[Swift-devel] Swift-issues (PBS+NFS Cluster)

yizhu yizhu at cs.uchicago.edu
Wed May 6 19:24:10 CDT 2009


Hi all

I tried running swift-0.8 over  Nimbus Cloud (PBS+NFS), and configured 
sites.xml and tc.data accordingly.[1]

When i tried to run "$swift first.swift", it  stuck on "Submitting:1" 
phase[2], (keep repeated showing "Progress:Submitting:1" and  never return).

Then I ssh to pbs_server to check the server log[3] and found that the 
job has been enqueued, ran, and successfully dequeued. I also check the 
queue status[4] when this job is running and found that the output_path 
is "/dev/null" somewhat i don't expected. ( The working directory of 
swift is "/home/jobrun".

I think the problem might be the pbs_server failed to return the output 
to the correct path (btw. what the output_path suppose to be, the same 
work_directory of swift?), or anyone has a better idea?


Many Thanks.


-Yi



--------------------------------------------------
[1]----sites.xml

  <pool handle="nb_basecluster">
     <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
     <execution 
url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" 
jobManager="PBS" provider="gt4" />
     <workdirectory >/home/jobrun</workdirectory>
   </pool>

----tc.data

nb_basecluster 	echo 		/bin/echo	INSTALLED	INTEL32::LINUX	null
nb_basecluster 	cat 		/bin/cat	INSTALLED	INTEL32::LINUX	null
nb_basecluster 	ls 		/bin/ls		INSTALLED	INTEL32::LINUX	null
nb_basecluster 	grep 		/bin/grep	INSTALLED	INTEL32::LINUX	null
nb_basecluster 	sort 		/bin/sort	INSTALLED	INTEL32::LINUX	null
nb_basecluster 	paste 		/bin/paste	INSTALLED	INTEL32::LINUX	null

---------------------------------------------------------
[2]
yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
Recompilation suppressed.
Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
Setting resources to: {nb_basecluster=nb_basecluster}
Swift 0.8 swift-r2448 cog-r2261

Swift 0.8 swift-r2448 cog-r2261

RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
RunID: 20090506-1912-zqd8t5hg
closed org.griphyn.vdl.mapping.RootDataNode identifier 
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
type string value=Hello, world! dataset=unnamed SwiftScript value (closed)
ROOTPATH 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
path=$
VALUE 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
VALUE=Hello, world!
closed org.griphyn.vdl.mapping.RootDataNode identifier 
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
type string value=Hello, world! dataset=unnamed SwiftScript value (closed)
ROOTPATH 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
path=$
VALUE 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
VALUE=Hello, world!
NEW 
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier 
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
type messagefile with no value at dataset=outfile (not closed).$
NEW 
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
Progress:
PROCEDURE thread=0 name=greeting
PARAM thread=0 direction=output variable=t 
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
closed org.griphyn.vdl.mapping.RootDataNode identifier 
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
type string value=hello.txt dataset=unnamed SwiftScript value (closed)
ROOTPATH 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
path=$
VALUE 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
VALUE=hello.txt
closed org.griphyn.vdl.mapping.RootDataNode identifier 
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
type string value=hello.txt dataset=unnamed SwiftScript value (closed)
ROOTPATH 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
path=$
VALUE 
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
VALUE=hello.txt
START thread=0 tr=echo
Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
Rand: 0.6583597597672994, sum: 1.0
Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
Progress:  Initializing site shared directory:1
START host=nb_basecluster - Initializing shared directory
multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
Old score: 0.000, new score: -0.010
No global submit throttle set. Using default (100)
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status 
to Completed
multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
Old score: -0.010, new score: 0.000
multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
Old score: 0.000, new score: 0.100
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M
multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
Old score: 0.100, new score: -0.100
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
to Submitting
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
to Submitted
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
to Active
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status 
to Completed
multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
Old score: -0.100, new score: 0.100
multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
Old score: 0.100, new score: 0.200
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
Old score: 0.200, new score: 0.000
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
to Submitting
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
to Submitted
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
to Active
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status 
to Completed
multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
Old score: 0.000, new score: 0.200
multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
Old score: 0.200, new score: 0.300
Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
Old score: 0.300, new score: 0.290
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status 
to Completed
multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
Old score: 0.290, new score: 0.300
multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
Old score: 0.300, new score: 0.400
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
Old score: 0.400, new score: 0.390
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status 
to Completed
multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
Old score: 0.390, new score: 0.400
multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
Old score: 0.400, new score: 0.500
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
Old score: 0.500, new score: 0.490
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status 
to Completed
multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
Old score: 0.490, new score: 0.500
multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
Old score: 0.500, new score: 0.600
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
END host=nb_basecluster - Done initializing shared directory
THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster 
replicationGroup=jfecpfaj
Progress:  Stage in:1
START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory 
structure
START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating directory 
structure
multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
Old score: 0.600, new score: 0.590
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status 
to Completed
multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
Old score: 0.590, new score: 0.600
multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
Old score: 0.600, new score: 0.700
Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. 
Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
END jobid=echo-kfecpfaj - Done initializing directory structure
START jobid=echo-kfecpfaj - Staging in files
END jobid=echo-kfecpfaj - Staging in finished
JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] 
tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj host=nb_basecluster
jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1241655174950)
multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
Old score: 0.700, new score: 0.500
Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting status 
to Submitting
Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)
<startTime name="submission">1241655180260</startTime>
<startTime name="createManagedJob">1241655180623</startTime>
<endTime name="createManagedJob">1241655181975</endTime
Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)
Progress:  Submitting:1

Progress:  Submitting:1

Progress:  Submitting:1
Progress:  Submitting:1
Progress:  Submitting:1
Progress:  Submitting:1
^C
yizhu at ubuntu:~/swift-0.8/examples/swift$

--------------------------------------------------------------
[3] see attachment
-------------------------------------------------------------
[4]
tp-x001 torque # qstat -f
Job Id: 3.tp-x001.ci.uchicago.edu
     Job_Name = STDIN
     Job_Owner = jobrun at tp-x001.ci.uchicago.edu
     job_state = R
     queue = batch
     server = tp-x001.ci.uchicago.edu
     Checkpoint = u
     ctime = Wed May  6 19:22:10 2009
     Error_Path = tp-x001.ci.uchicago.edu:/dev/null
     exec_host = tp-x002/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Wed May  6 19:22:10 2009
     Output_Path = tp-x001.ci.uchicago.edu:/dev/null
     Priority = 0
     qtime = Wed May  6 19:22:10 2009
     Rerunable = True
     Resource_List.neednodes = 1
     Resource_List.nodect = 1
     Resource_List.nodes = 1
     Shell_Path_List = /bin/sh
     substate = 40
     Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
	PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
	in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
	PBS_O_HOST=tp-x001.ci.uchicago.edu,
	PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
	PBS_O_QUEUE=batch
     euser = jobrun
     egroup = users
     hashname = 3.tp-x001.c
     queue_rank = 2
     queue_type = E
     comment = Job started on Wed May 06 at 19:22
     etime = Wed May  6 19:22:10 2009

tp-x001 torque #

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20090506
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090506/b3f98034/attachment.ksh>


More information about the Swift-devel mailing list