[Swift-devel] Swift-issues (PBS+NFS Cluster)
Michael Wilde
wilde at mcs.anl.gov
Wed May 6 22:48:44 CDT 2009
Very good!
Now, what kind of tests can you do next?
Can you exercise the cluster with an interesting workflow?
How large of a cluster can you assemble in a Nimbus workspace ?
Can you aggregate VM's from a few different physical clusters into one
Nimbus workspace?
What's the largest cluster you can assemble with Nimbus?
Can we try putting Matlab on a Nimbus VM and then running it at an
interesting scale?
Do we have any free allocations of EC2 to enable testing of this at Amazon?
- Mike
On 5/6/09 10:10 PM, yizhu wrote:
> problem solved.
>
> i immigrate all of my work from my laptop to ci.uchicago.edu server and
> then everything works.
>
>
> -Yi
> yizhu wrote:
>> Hi all
>>
>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and configured
>> sites.xml and tc.data accordingly.[1]
>>
>> When i tried to run "$swift first.swift", it stuck on "Submitting:1"
>> phase[2], (keep repeated showing "Progress:Submitting:1" and never
>> return).
>>
>> Then I ssh to pbs_server to check the server log[3] and found that the
>> job has been enqueued, ran, and successfully dequeued. I also check
>> the queue status[4] when this job is running and found that the
>> output_path is "/dev/null" somewhat i don't expected. ( The working
>> directory of swift is "/home/jobrun".
>>
>> I think the problem might be the pbs_server failed to return the
>> output to the correct path (btw. what the output_path suppose to be,
>> the same work_directory of swift?), or anyone has a better idea?
>>
>>
>> Many Thanks.
>>
>>
>> -Yi
>>
>>
>>
>> --------------------------------------------------
>> [1]----sites.xml
>>
>> <pool handle="nb_basecluster">
>> <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>> <execution
>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService"
>> jobManager="PBS" provider="gt4" />
>> <workdirectory >/home/jobrun</workdirectory>
>> </pool>
>>
>> ----tc.data
>>
>> nb_basecluster echo /bin/echo INSTALLED
>> INTEL32::LINUX null
>> nb_basecluster cat /bin/cat INSTALLED
>> INTEL32::LINUX null
>> nb_basecluster ls /bin/ls INSTALLED
>> INTEL32::LINUX null
>> nb_basecluster grep /bin/grep INSTALLED
>> INTEL32::LINUX null
>> nb_basecluster sort /bin/sort INSTALLED
>> INTEL32::LINUX null
>> nb_basecluster paste /bin/paste INSTALLED
>> INTEL32::LINUX null
>>
>> ---------------------------------------------------------
>> [2]
>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>> Recompilation suppressed.
>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>> Setting resources to: {nb_basecluster=nb_basecluster}
>> Swift 0.8 swift-r2448 cog-r2261
>>
>> Swift 0.8 swift-r2448 cog-r2261
>>
>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
>> RunID: 20090506-1912-zqd8t5hg
>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>> type string value=Hello, world! dataset=unnamed SwiftScript value
>> (closed)
>> ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>> path=$
>> VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>> VALUE=Hello, world!
>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>> type string value=Hello, world! dataset=unnamed SwiftScript value
>> (closed)
>> ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>> path=$
>> VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>> VALUE=Hello, world!
>> NEW
>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>
>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>> type messagefile with no value at dataset=outfile (not closed).$
>> NEW
>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>
>> Progress:
>> PROCEDURE thread=0 name=greeting
>> PARAM thread=0 direction=output variable=t
>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>
>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>> ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>> path=$
>> VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>> VALUE=hello.txt
>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>> ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>> path=$
>> VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>> VALUE=hello.txt
>> START thread=0 tr=echo
>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>> Rand: 0.6583597597672994, sum: 1.0
>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>> Progress: Initializing site shared directory:1
>> START host=nb_basecluster - Initializing shared directory
>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>> Old score: 0.000, new score: -0.010
>> No global submit throttle set. Using default (100)
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>> status to Submitting
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>> status to Submitted
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>> status to Active
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>> status to Completed
>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>> Old score: -0.010, new score: 0.000
>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>> Old score: 0.000, new score: 0.100
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M
>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>> Old score: 0.100, new score: -0.100
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>> status to Submitting
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>> status to Submitted
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>> status to Active
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>> status to Completed
>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>> Old score: -0.100, new score: 0.100
>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>> Old score: 0.100, new score: 0.200
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>> Old score: 0.200, new score: 0.000
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>> status to Submitting
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>> status to Submitted
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>> status to Active
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>> status to Completed
>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>> Old score: 0.000, new score: 0.200
>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>> Old score: 0.200, new score: 0.300
>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>> Old score: 0.300, new score: 0.290
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>> status to Submitting
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>> status to Submitted
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>> status to Active
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>> status to Completed
>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>> Old score: 0.290, new score: 0.300
>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>> Old score: 0.300, new score: 0.400
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>> Old score: 0.400, new score: 0.390
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>> status to Submitting
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>> status to Submitted
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>> status to Active
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>> status to Completed
>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>> Old score: 0.390, new score: 0.400
>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>> Old score: 0.400, new score: 0.500
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>> Old score: 0.500, new score: 0.490
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>> status to Submitting
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>> status to Submitted
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>> status to Active
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>> status to Completed
>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>> Old score: 0.490, new score: 0.500
>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>> Old score: 0.500, new score: 0.600
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>> END host=nb_basecluster - Done initializing shared directory
>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster
>> replicationGroup=jfecpfaj
>> Progress: Stage in:1
>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory
>> structure
>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating
>> directory structure
>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>> Old score: 0.600, new score: 0.590
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>> status to Submitting
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>> status to Submitted
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>> status to Active
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>> status to Completed
>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>> Old score: 0.590, new score: 0.600
>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>> Old score: 0.600, new score: 0.700
>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed.
>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>> END jobid=echo-kfecpfaj - Done initializing directory structure
>> START jobid=echo-kfecpfaj - Staging in files
>> END jobid=echo-kfecpfaj - Staging in finished
>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!]
>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj
>> host=nb_basecluster
>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION,
>> identity=urn:0-1-1241655174950)
>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>> Old score: 0.700, new score: 0.500
>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting
>> status to Submitting
>> Submitting task: Task(type=JOB_SUBMISSION,
>> identity=urn:0-1-1241655174950)
>> <startTime name="submission">1241655180260</startTime>
>> <startTime name="createManagedJob">1241655180623</startTime>
>> <endTime name="createManagedJob">1241655181975</endTime
>> Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)
>> Progress: Submitting:1
>>
>> Progress: Submitting:1
>>
>> Progress: Submitting:1
>> Progress: Submitting:1
>> Progress: Submitting:1
>> Progress: Submitting:1
>> ^C
>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>
>> --------------------------------------------------------------
>> [3] see attachment
>> -------------------------------------------------------------
>> [4]
>> tp-x001 torque # qstat -f
>> Job Id: 3.tp-x001.ci.uchicago.edu
>> Job_Name = STDIN
>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>> job_state = R
>> queue = batch
>> server = tp-x001.ci.uchicago.edu
>> Checkpoint = u
>> ctime = Wed May 6 19:22:10 2009
>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>> exec_host = tp-x002/0
>> Hold_Types = n
>> Join_Path = n
>> Keep_Files = n
>> Mail_Points = n
>> mtime = Wed May 6 19:22:10 2009
>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>> Priority = 0
>> qtime = Wed May 6 19:22:10 2009
>> Rerunable = True
>> Resource_List.neednodes = 1
>> Resource_List.nodect = 1
>> Resource_List.nodes = 1
>> Shell_Path_List = /bin/sh
>> substate = 40
>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
>> PBS_O_HOST=tp-x001.ci.uchicago.edu,
>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>> PBS_O_QUEUE=batch
>> euser = jobrun
>> egroup = users
>> hashname = 3.tp-x001.c
>> queue_rank = 2
>> queue_type = E
>> comment = Job started on Wed May 06 at 19:22
>> etime = Wed May 6 19:22:10 2009
>>
>> tp-x001 torque #
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list