[Swift-devel] Swift-issues (PBS+NFS Cluster)

Michael Wilde wilde at mcs.anl.gov
Wed May 6 19:51:54 CDT 2009


Yi, I assume you are testing from communicado to the tp-x001 virtual 
machine?

Does that vm have GT2 GRAM running?

If so, I would first verify that you can do basic globus-job-run and 
globus-url-copy to the vm, and then use a GT2 setting in sites.xml to 
talk to the vm from Swift.

- Mike

ps. I have not been following the list for the past week, so I need to 
review it and see what your plan was, and what Ben and others 
recommended. I'll try to catch up on that soon, and then we should meet 
at the CI and discuss.


On 5/6/09 7:24 PM, yizhu wrote:
> Hi all
> 
> I tried running swift-0.8 over  Nimbus Cloud (PBS+NFS), and configured 
> sites.xml and tc.data accordingly.[1]
> 
> When i tried to run "$swift first.swift", it  stuck on "Submitting:1" 
> phase[2], (keep repeated showing "Progress:Submitting:1" and  never 
> return).
> 
> Then I ssh to pbs_server to check the server log[3] and found that the 
> job has been enqueued, ran, and successfully dequeued. I also check the 
> queue status[4] when this job is running and found that the output_path 
> is "/dev/null" somewhat i don't expected. ( The working directory of 
> swift is "/home/jobrun".
> 
> I think the problem might be the pbs_server failed to return the output 
> to the correct path (btw. what the output_path suppose to be, the same 
> work_directory of swift?), or anyone has a better idea?
> 
> 
> Many Thanks.
> 
> 
> -Yi
> 
> 
> 
> --------------------------------------------------
> [1]----sites.xml
> 
>  <pool handle="nb_basecluster">
>     <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>     <execution 
> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" 
> jobManager="PBS" provider="gt4" />
>     <workdirectory >/home/jobrun</workdirectory>
>   </pool>
> 
> ----tc.data
> 
> nb_basecluster     echo         /bin/echo    INSTALLED    
> INTEL32::LINUX    null
> nb_basecluster     cat         /bin/cat    INSTALLED    
> INTEL32::LINUX    null
> nb_basecluster     ls         /bin/ls        INSTALLED    
> INTEL32::LINUX    null
> nb_basecluster     grep         /bin/grep    INSTALLED    
> INTEL32::LINUX    null
> nb_basecluster     sort         /bin/sort    INSTALLED    
> INTEL32::LINUX    null
> nb_basecluster     paste         /bin/paste    INSTALLED    
> INTEL32::LINUX    null
> 
> ---------------------------------------------------------
> [2]
> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
> Recompilation suppressed.
> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
> Setting resources to: {nb_basecluster=nb_basecluster}
> Swift 0.8 swift-r2448 cog-r2261
> 
> Swift 0.8 swift-r2448 cog-r2261
> 
> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
> RunID: 20090506-1912-zqd8t5hg
> closed org.griphyn.vdl.mapping.RootDataNode identifier 
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> type string value=Hello, world! dataset=unnamed SwiftScript value (closed)
> ROOTPATH 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> path=$
> VALUE 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> VALUE=Hello, world!
> closed org.griphyn.vdl.mapping.RootDataNode identifier 
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> type string value=Hello, world! dataset=unnamed SwiftScript value (closed)
> ROOTPATH 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> path=$
> VALUE 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> VALUE=Hello, world!
> NEW 
> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
> 
> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier 
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
> type messagefile with no value at dataset=outfile (not closed).$
> NEW 
> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
> 
> Progress:
> PROCEDURE thread=0 name=greeting
> PARAM thread=0 direction=output variable=t 
> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
> 
> closed org.griphyn.vdl.mapping.RootDataNode identifier 
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
> ROOTPATH 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
> path=$
> VALUE 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
> VALUE=hello.txt
> closed org.griphyn.vdl.mapping.RootDataNode identifier 
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
> ROOTPATH 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
> path=$
> VALUE 
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
> VALUE=hello.txt
> START thread=0 tr=echo
> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
> Rand: 0.6583597597672994, sum: 1.0
> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
> Progress:  Initializing site shared directory:1
> START host=nb_basecluster - Initializing shared directory
> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
> Old score: 0.000, new score: -0.010
> No global submit throttle set. Using default (100)
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
> to Completed
> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
> Old score: -0.010, new score: 0.000
> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
> Old score: 0.000, new score: 0.100
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M
> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
> Old score: 0.100, new score: -0.100
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
> to Submitting
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
> to Submitted
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
> to Active
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
> to Completed
> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
> Old score: -0.100, new score: 0.100
> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
> Old score: 0.100, new score: 0.200
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
> Old score: 0.200, new score: 0.000
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
> to Submitting
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
> to Submitted
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
> to Active
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
> to Completed
> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
> Old score: 0.000, new score: 0.200
> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
> Old score: 0.200, new score: 0.300
> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
> Old score: 0.300, new score: 0.290
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
> to Completed
> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
> Old score: 0.290, new score: 0.300
> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
> Old score: 0.300, new score: 0.400
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
> Old score: 0.400, new score: 0.390
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
> to Completed
> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
> Old score: 0.390, new score: 0.400
> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
> Old score: 0.400, new score: 0.500
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
> Old score: 0.500, new score: 0.490
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
> to Completed
> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
> Old score: 0.490, new score: 0.500
> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
> Old score: 0.500, new score: 0.600
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
> END host=nb_basecluster - Done initializing shared directory
> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster 
> replicationGroup=jfecpfaj
> Progress:  Stage in:1
> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory 
> structure
> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating directory 
> structure
> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
> Old score: 0.600, new score: 0.590
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
> to Completed
> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
> Old score: 0.590, new score: 0.600
> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
> Old score: 0.600, new score: 0.700
> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. 
> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
> END jobid=echo-kfecpfaj - Done initializing directory structure
> START jobid=echo-kfecpfaj - Staging in files
> END jobid=echo-kfecpfaj - Staging in finished
> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] 
> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj 
> host=nb_basecluster
> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1241655174950)
> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
> Old score: 0.700, new score: 0.500
> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting status 
> to Submitting
> Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)
> <startTime name="submission">1241655180260</startTime>
> <startTime name="createManagedJob">1241655180623</startTime>
> <endTime name="createManagedJob">1241655181975</endTime
> Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)
> Progress:  Submitting:1
> 
> Progress:  Submitting:1
> 
> Progress:  Submitting:1
> Progress:  Submitting:1
> Progress:  Submitting:1
> Progress:  Submitting:1
> ^C
> yizhu at ubuntu:~/swift-0.8/examples/swift$
> 
> --------------------------------------------------------------
> [3] see attachment
> -------------------------------------------------------------
> [4]
> tp-x001 torque # qstat -f
> Job Id: 3.tp-x001.ci.uchicago.edu
>     Job_Name = STDIN
>     Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>     job_state = R
>     queue = batch
>     server = tp-x001.ci.uchicago.edu
>     Checkpoint = u
>     ctime = Wed May  6 19:22:10 2009
>     Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>     exec_host = tp-x002/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = n
>     mtime = Wed May  6 19:22:10 2009
>     Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>     Priority = 0
>     qtime = Wed May  6 19:22:10 2009
>     Rerunable = True
>     Resource_List.neednodes = 1
>     Resource_List.nodect = 1
>     Resource_List.nodes = 1
>     Shell_Path_List = /bin/sh
>     substate = 40
>     Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>     PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
>     in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
>     PBS_O_HOST=tp-x001.ci.uchicago.edu,
>     PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>     PBS_O_QUEUE=batch
>     euser = jobrun
>     egroup = users
>     hashname = 3.tp-x001.c
>     queue_rank = 2
>     queue_type = E
>     comment = Job started on Wed May 06 at 19:22
>     etime = Wed May  6 19:22:10 2009
> 
> tp-x001 torque #
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list