From yizhu at cs.uchicago.edu Wed Jul 1 04:33:58 2009
From: yizhu at cs.uchicago.edu (yizhu)
Date: Wed, 01 Jul 2009 04:33:58 -0500
Subject: [Swift-devel] swift error ( gridftp problem)
Message-ID: <4A4B2D86.6050608@cs.uchicago.edu>
Hi,
I have a problem when try running swift on Amazon EC2 with swift on
local computer.
The EC2 is configured as a globus Installed PBS cluster with one head
node and several and shared the /home/ directory via NFS , i've use
simpleCA to create a credential for both headnode (host certificate) and
user (user certificate).
after make simpleCA working, I finally get rid of "Authentication
Failure" when running swift, but a new problem occurs; it stuck on
"Progress: Initializing site shared directory:1" and finally failed
after several try. After that, I checked the "swift workdirectory" and
found that new directory has been created with a 0 byte file "_swiftwrap".
I also tried run globus-url-copy on client side, it failed with the file
named created at remote site but with 0 byte size. It seems that gridftp
can successfully create the directory and filename, but can not actually
transfer the data.
For the firewall setting on EC2, i opened tcp/udp 2119 (gridftp),
tcp/udp 2811(gram2), tcp/udp 8443 (gram4), (ssh), (https), (http).
-Yi
[1] Swift failed
-bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml
first.swift
Swift 0.9 swift-r2860 cog-r2388
RunID: 20090701-0344-zn2a66ub
Progress:
Progress: Initializing site shared directory:1
Progress: Initializing site shared directory:1
Progress: Initializing site shared directory:1
Progress: Initializing site shared directory:1
Progress: Initializing site shared directory:1
Progress: Failed:1
Execution failed:
Could not initialize shared directory on ec2_basecluster
Caused by:
Reply wait timeout. (error code 4)
-bash-3.2$
[2] Grid-ftp-failed
-bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt
gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com/rec_data.txt
-bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt
gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com:2811/home/torqueuser/rec_data.txt
GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: Server refused
performing the request. Custom message: (error code 1) [Nested
exception message: Custom message: Unexpected reply: 500-Command
failed. : globus_gridftp_server_file.c:globus_l_gfs_file_recv:1770:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file /home/torqueuser/home/torqueuser/rec_data.txt
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or directory
500 End.]]
-bash-3.2$
-bash-3.2$
[3]-bash-3.2$ cat tc.test.data
...
...
ec2_basecluster echo /bin/echo INSTALLED
INTEL32::LINUX null
ec2_basecluster cat /bin/cat INSTALLED
INTEL32::LINUX null
ec2_basecluster ls /bin/ls INSTALLED
INTEL32::LINUX null
ec2_basecluster grep /bin/grep INSTALLED
INTEL32::LINUX null
ec2_basecluster sort /bin/sort INSTALLED
INTEL32::LINUX null
ec2_basecluster paste /bin/paste INSTALLED
INTEL32::LINUX null
ec2_basecluster wc /bin/wc INSTALLED
INTEL32::LINUX null
ec2_basecluster touch /bin/touch INSTALLED
INTEL32::LINUX null
ec2_basecluster sleep /bin/sleep INSTALLED
INTEL32::LINUX null
...
...
[4] -bash-3.2$ cat sites.test.xml
...
/home/torqueuser
...
[5] debug version of swift run
-bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml
first.swift -debug
Max heap: 268435456
kmlversion is >85d4b03e-7b73-49b7-81aa-096255181491<
build version is >85d4b03e-7b73-49b7-81aa-096255181491<
Recompilation suppressed.
Stack dump:
Level 1
[iA = 0, iB = 0, bA = false, bB = false]
vdl:instanceconfig = Swift configuration []
vdl:operation = run
vds.home = /home/yizhu/swift-0.9/bin/..
Using sites file: ../sites.test.xml
Using tc.data: ../tc.test.data
Setting resources to: {ec2_basecluster=ec2_basecluster}
Swift 0.9 swift-r2860 cog-r2388
Swift 0.9 swift-r2860 cog-r2388
RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090701-0348-vrb1yxl6
RunID: 20090701-0348-vrb1yxl6
closed org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
type string value=Hello, world! dataset=unnamed SwiftScript value (closed)
ROOTPATH
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
path=$
VALUE
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
VALUE=Hello, world!
NEW
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002
type messagefile with no value at dataset=outfile (not closed).$
NEW
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002
Progress:
PROCEDURE line=3 thread=0 name=greeting
PARAM thread=0 direction=output variable=t
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002
closed org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003
type string value=hello.txt dataset=unnamed SwiftScript value (closed)
ROOTPATH
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003
path=$
VALUE
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003
VALUE=hello.txt
START thread=0 tr=echo
Sorted: [ec2_basecluster:0.000(1.000):0/1 overload: 0]
Rand: 0.8176156212454151, sum: 1.0
Next contact: ec2_basecluster:0.000(1.000):0/1 overload: 0
START host=ec2_basecluster - Initializing shared directory
multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
Old score: 0.000, new score: -0.010
No global submit throttle set. Using default (100)
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
to Completed
multiplyScore(ec2_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
Old score: -0.010, new score: 0.000
multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
Old score: 0.000, new score: 0.100
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) Completed.
Waiting: 0, Running: 0. Heap size: 64M, Heap free: 30M, Max heap: 256M
multiplyScore(ec2_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
Old score: 0.100, new score: -0.100
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
to Submitting
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
to Submitted
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
to Active
Progress: Initializing site shared directory:1
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
to Failed null
multiplyScore(ec2_basecluster:-0.100(0.943):1/1 overload: 0, -0.5)
Old score: -0.100, new score: -0.600
Releasing contact 2
commitDelayedScore(ec2_basecluster:-0.600(0.705):0/1 overload: 0, 0.1
Sorted: [ec2_basecluster:-0.500(0.747):0/1 overload: 0]
Rand: 0.4103224563240889, sum: 1.0
Next contact: ec2_basecluster:-0.500(0.747):0/1 overload: 0
Progress: Initializing site shared directory:1
START host=ec2_basecluster - Initializing shared directory
multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: -140, -0.01)
Old score: -0.500, new score: -0.510
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
to Completed
multiplyScore(ec2_basecluster:-0.510(0.742):1/1 overload: 0, 0.01)
Old score: -0.510, new score: -0.500
multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: 0, 0.1)
Old score: -0.500, new score: -0.400
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) Completed.
Waiting: 0, Running: 0. Heap size: 64M, Heap free: 28M, Max heap: 256M
multiplyScore(ec2_basecluster:-0.400(0.791):1/1 overload: 0, -0.2)
Old score: -0.400, new score: -0.600
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
to Submitting
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
to Submitted
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
to Active
Progress: Initializing site shared directory:1
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
to Failed null
multiplyScore(ec2_basecluster:-0.600(0.705):1/1 overload: 0, -0.5)
Old score: -0.600, new score: -1.100
Releasing contact 3
commitDelayedScore(ec2_basecluster:-1.100(0.530):0/1 overload: 0, 0.1
Sorted: [ec2_basecluster:-1.000(0.561):0/1 overload: 0]
Rand: 0.653323366777857, sum: 1.0
Next contact: ec2_basecluster:-1.000(0.561):0/1 overload: 0
Progress: Initializing site shared directory:1
START host=ec2_basecluster - Initializing shared directory
multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: -199, -0.01)
Old score: -1.000, new score: -1.010
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
to Submitting
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
to Submitted
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
to Active
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
to Completed
multiplyScore(ec2_basecluster:-1.010(0.557):1/1 overload: 0, 0.01)
Old score: -1.010, new score: -1.000
multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: 0, 0.1)
Old score: -1.000, new score: -0.900
Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) Completed.
Waiting: 0, Running: 0. Heap size: 64M, Heap free: 27M, Max heap: 256M
multiplyScore(ec2_basecluster:-0.900(0.593):1/1 overload: 0, -0.2)
Old score: -0.900, new score: -1.100
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
to Submitting
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
to Submitted
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
to Active
Progress: Initializing site shared directory:1
Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
to Failed null
multiplyScore(ec2_basecluster:-1.100(0.530):1/1 overload: 0, -0.5)
Old score: -1.100, new score: -1.600
Releasing contact 4
commitDelayedScore(ec2_basecluster:-1.600(0.403):0/1 overload: 0, 0.1
END_FAILURE thread=0 tr=echo
Progress: Failed:1
Could not initialize shared directory on ec2_basecluster
Could not initialize shared directory on ec2_basecluster
Caused by: null
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
(error code 4)
at
org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
at
org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
at
org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
Caused by: null
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
(error code 4)
at
org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36)
at
org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151)
at
org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314)
at
org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276)
at
org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168)
at
org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656)
at
org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421)
at
org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
at
org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505)
at java.lang.Thread.run(Thread.java:595)
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
at
org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44)
at
org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33)
at
org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355)
at
org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492)
... 1 more
Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
(error code 4)
at
org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511)
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129)
... 1 more
Execution failed:
Could not initialize shared directory on ec2_basecluster
Caused by:
Reply wait timeout. (error code 4)
Detailed exception:
Could not initialize shared directory on ec2_basecluster
Caused by: null
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
(error code 4)
at
org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
at
org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
at
org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
Caused by: null
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
(error code 4)
at
org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36)
at
org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151)
at
org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314)
at
org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276)
at
org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168)
at
org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656)
at
org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421)
at
org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
at
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
at
org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505)
at java.lang.Thread.run(Thread.java:595)
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
at
org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44)
at
org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33)
at
org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355)
at
org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47)
at
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492)
... 1 more
Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
(error code 4)
at
org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511)
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129)
... 1 more
Swift finished with errors
-bash-3.2$
-bash-3.2$
From wilde at mcs.anl.gov Wed Jul 1 07:00:07 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 01 Jul 2009 07:00:07 -0500
Subject: [Swift-devel] swift error ( gridftp problem)
In-Reply-To: <4A4B2D86.6050608@cs.uchicago.edu>
References: <4A4B2D86.6050608@cs.uchicago.edu>
Message-ID: <4A4B4FC7.5090607@mcs.anl.gov>
Yi, I dont have an answer for you, but it certainly seems to be a
problem at the GridFTP level, not a Swift problem.
Do you have GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE set in
your client environment (ie on the "local computer")?
From that local computer, with an ordinary (e.g., DOEGrids or NCSA)
certificate, can you access files on for example TeraPort?
- Mike
On 7/1/09 4:33 AM, yizhu wrote:
> Hi,
>
> I have a problem when try running swift on Amazon EC2 with swift on
> local computer.
>
> The EC2 is configured as a globus Installed PBS cluster with one head
> node and several and shared the /home/ directory via NFS , i've use
> simpleCA to create a credential for both headnode (host certificate) and
> user (user certificate).
>
> after make simpleCA working, I finally get rid of "Authentication
> Failure" when running swift, but a new problem occurs; it stuck on
> "Progress: Initializing site shared directory:1" and finally failed
> after several try. After that, I checked the "swift workdirectory" and
> found that new directory has been created with a 0 byte file "_swiftwrap".
>
> I also tried run globus-url-copy on client side, it failed with the file
> named created at remote site but with 0 byte size. It seems that gridftp
> can successfully create the directory and filename, but can not actually
> transfer the data.
>
> For the firewall setting on EC2, i opened tcp/udp 2119 (gridftp),
> tcp/udp 2811(gram2), tcp/udp 8443 (gram4), (ssh), (https), (http).
>
>
> -Yi
> [1] Swift failed
> -bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml
> first.swift
> Swift 0.9 swift-r2860 cog-r2388
>
> RunID: 20090701-0344-zn2a66ub
> Progress:
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Failed:1
> Execution failed:
> Could not initialize shared directory on ec2_basecluster
> Caused by:
> Reply wait timeout. (error code 4)
> -bash-3.2$
>
> [2] Grid-ftp-failed
> -bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt
> gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com/rec_data.txt
> -bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt
> gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com:2811/home/torqueuser/rec_data.txt
>
> GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: Server refused
> performing the request. Custom message: (error code 1) [Nested
> exception message: Custom message: Unexpected reply: 500-Command
> failed. : globus_gridftp_server_file.c:globus_l_gfs_file_recv:1770:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file /home/torqueuser/home/torqueuser/rec_data.txt
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or directory
> 500 End.]]
> -bash-3.2$
> -bash-3.2$
>
>
> [3]-bash-3.2$ cat tc.test.data
>
> ...
> ...
>
> ec2_basecluster echo /bin/echo INSTALLED
> INTEL32::LINUX null
> ec2_basecluster cat /bin/cat INSTALLED
> INTEL32::LINUX null
> ec2_basecluster ls /bin/ls INSTALLED
> INTEL32::LINUX null
> ec2_basecluster grep /bin/grep INSTALLED
> INTEL32::LINUX null
> ec2_basecluster sort /bin/sort INSTALLED
> INTEL32::LINUX null
> ec2_basecluster paste /bin/paste INSTALLED
> INTEL32::LINUX null
> ec2_basecluster wc /bin/wc INSTALLED
> INTEL32::LINUX null
> ec2_basecluster touch /bin/touch INSTALLED
> INTEL32::LINUX null
> ec2_basecluster sleep /bin/sleep INSTALLED
> INTEL32::LINUX null
>
> ...
> ...
>
> [4] -bash-3.2$ cat sites.test.xml
>
>
> ...
>
>
>
>
> url="ec2-174-129-90-225.compute-1.amazonaws.com/jobmanager-pbs"
> major="2" />
> /home/torqueuser
>
> ...
>
>
> [5] debug version of swift run
> -bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml
> first.swift -debug
> Max heap: 268435456
> kmlversion is >85d4b03e-7b73-49b7-81aa-096255181491<
> build version is >85d4b03e-7b73-49b7-81aa-096255181491<
> Recompilation suppressed.
> Stack dump:
> Level 1
> [iA = 0, iB = 0, bA = false, bB = false]
> vdl:instanceconfig = Swift configuration []
> vdl:operation = run
> vds.home = /home/yizhu/swift-0.9/bin/..
>
>
> Using sites file: ../sites.test.xml
> Using tc.data: ../tc.test.data
> Setting resources to: {ec2_basecluster=ec2_basecluster}
> Swift 0.9 swift-r2860 cog-r2388
>
> Swift 0.9 swift-r2860 cog-r2388
>
> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090701-0348-vrb1yxl6
> RunID: 20090701-0348-vrb1yxl6
> closed org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
> type string value=Hello, world! dataset=unnamed SwiftScript value (closed)
> ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
> path=$
> VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
> VALUE=Hello, world!
> NEW
> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001
>
> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002
> type messagefile with no value at dataset=outfile (not closed).$
> NEW
> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002
>
> Progress:
> PROCEDURE line=3 thread=0 name=greeting
> PARAM thread=0 direction=output variable=t
> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002
>
> closed org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003
> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
> ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003
> path=$
> VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003
> VALUE=hello.txt
> START thread=0 tr=echo
> Sorted: [ec2_basecluster:0.000(1.000):0/1 overload: 0]
> Rand: 0.8176156212454151, sum: 1.0
> Next contact: ec2_basecluster:0.000(1.000):0/1 overload: 0
> START host=ec2_basecluster - Initializing shared directory
> multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
> Old score: 0.000, new score: -0.010
> No global submit throttle set. Using default (100)
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status
> to Completed
> multiplyScore(ec2_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
> Old score: -0.010, new score: 0.000
> multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
> Old score: 0.000, new score: 0.100
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) Completed.
> Waiting: 0, Running: 0. Heap size: 64M, Heap free: 30M, Max heap: 256M
> multiplyScore(ec2_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
> Old score: 0.100, new score: -0.100
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
> to Submitting
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
> to Submitted
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
> to Active
>
>
>
>
>
>
>
>
> Progress: Initializing site shared directory:1
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status
> to Failed null
> multiplyScore(ec2_basecluster:-0.100(0.943):1/1 overload: 0, -0.5)
> Old score: -0.100, new score: -0.600
> Releasing contact 2
> commitDelayedScore(ec2_basecluster:-0.600(0.705):0/1 overload: 0, 0.1
> Sorted: [ec2_basecluster:-0.500(0.747):0/1 overload: 0]
> Rand: 0.4103224563240889, sum: 1.0
> Next contact: ec2_basecluster:-0.500(0.747):0/1 overload: 0
> Progress: Initializing site shared directory:1
> START host=ec2_basecluster - Initializing shared directory
> multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: -140, -0.01)
> Old score: -0.500, new score: -0.510
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status
> to Completed
> multiplyScore(ec2_basecluster:-0.510(0.742):1/1 overload: 0, 0.01)
> Old score: -0.510, new score: -0.500
> multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: 0, 0.1)
> Old score: -0.500, new score: -0.400
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) Completed.
> Waiting: 0, Running: 0. Heap size: 64M, Heap free: 28M, Max heap: 256M
> multiplyScore(ec2_basecluster:-0.400(0.791):1/1 overload: 0, -0.2)
> Old score: -0.400, new score: -0.600
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
> to Submitting
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
> to Submitted
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
> to Active
>
>
> Progress: Initializing site shared directory:1
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status
> to Failed null
> multiplyScore(ec2_basecluster:-0.600(0.705):1/1 overload: 0, -0.5)
> Old score: -0.600, new score: -1.100
> Releasing contact 3
> commitDelayedScore(ec2_basecluster:-1.100(0.530):0/1 overload: 0, 0.1
> Sorted: [ec2_basecluster:-1.000(0.561):0/1 overload: 0]
> Rand: 0.653323366777857, sum: 1.0
> Next contact: ec2_basecluster:-1.000(0.561):0/1 overload: 0
> Progress: Initializing site shared directory:1
> START host=ec2_basecluster - Initializing shared directory
> multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: -199, -0.01)
> Old score: -1.000, new score: -1.010
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
> to Submitting
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
> to Submitted
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
> to Active
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status
> to Completed
> multiplyScore(ec2_basecluster:-1.010(0.557):1/1 overload: 0, 0.01)
> Old score: -1.010, new score: -1.000
> multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: 0, 0.1)
> Old score: -1.000, new score: -0.900
> Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) Completed.
> Waiting: 0, Running: 0. Heap size: 64M, Heap free: 27M, Max heap: 256M
> multiplyScore(ec2_basecluster:-0.900(0.593):1/1 overload: 0, -0.2)
> Old score: -0.900, new score: -1.100
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
> to Submitting
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
> to Submitted
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
> to Active
> Progress: Initializing site shared directory:1
> Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status
> to Failed null
> multiplyScore(ec2_basecluster:-1.100(0.530):1/1 overload: 0, -0.5)
> Old score: -1.100, new score: -1.600
> Releasing contact 4
> commitDelayedScore(ec2_basecluster:-1.600(0.403):0/1 overload: 0, 0.1
> END_FAILURE thread=0 tr=echo
> Progress: Failed:1
> Could not initialize shared directory on ec2_basecluster
> Could not initialize shared directory on ec2_basecluster
> Caused by: null
> Caused by:
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
> Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
> (error code 4)
> at
> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
>
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> Caused by: null
> Caused by:
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
> Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
> (error code 4)
> at
> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36)
>
> at
> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151)
>
> at
> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314)
>
> at
> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276)
>
> at
> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168)
>
> at
> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656)
>
> at
> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421)
>
> at
> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410)
>
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
>
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
>
> at
> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505)
>
> at java.lang.Thread.run(Thread.java:595)
> Caused by:
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
> at
> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44)
>
> at
> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33)
>
> at
> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492)
>
> ... 1 more
> Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
> (error code 4)
> at
> org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511)
>
> at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129)
> ... 1 more
> Execution failed:
> Could not initialize shared directory on ec2_basecluster
> Caused by:
> Reply wait timeout. (error code 4)
> Detailed exception:
> Could not initialize shared directory on ec2_basecluster
> Caused by: null
> Caused by:
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
> Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
> (error code 4)
> at
> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
>
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> Caused by: null
> Caused by:
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
> Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
> (error code 4)
> at
> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36)
>
> at
> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151)
>
> at
> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314)
>
> at
> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276)
>
> at
> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168)
>
> at
> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656)
>
> at
> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421)
>
> at
> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410)
>
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
>
> at
> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
>
> at
> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505)
>
> at java.lang.Thread.run(Thread.java:595)
> Caused by:
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException
> at
> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44)
>
> at
> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33)
>
> at
> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47)
>
> at
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492)
>
> ... 1 more
> Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout.
> (error code 4)
> at
> org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511)
>
> at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129)
> ... 1 more
> Swift finished with errors
> -bash-3.2$
> -bash-3.2$
>
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Wed Jul 1 07:26:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 1 Jul 2009 12:26:11 +0000 (GMT)
Subject: [Swift-devel] swift error ( gridftp problem)
In-Reply-To: <4A4B2D86.6050608@cs.uchicago.edu>
References: <4A4B2D86.6050608@cs.uchicago.edu>
Message-ID:
this is almost definitely a firewall problem, with you not having the
correct ports for gridftp data channels open.
read this:
http://dev.globus.org/wiki/FirewallHowTo
You need to configure an ephemeral port range in your firewall, of maybe
1000 ports, and declare it in the GLOBUS_TCP_PORT_RANGE for your server,
as described here:
http://dev.globus.org/wiki/FirewallHowTo#Configuring_GridFTP_to_use_GLOBUS_TCP_PORT_RANGE
Make sure you can transfer a file with globus-url-copy before attempting
to run Swift.
This is not a swift-specific problem.
--
From benc at hawaga.org.uk Wed Jul 1 10:37:28 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 1 Jul 2009 15:37:28 +0000 (GMT)
Subject: [Swift-devel] writeData
Message-ID:
r2994 contains a writeData function which does the opposite of readData.
Specifically, you can say:
file l;
l = writeData(@f);
to output the filenames for a data structure into a text file, so that you
can pass this instead of passing filenames on the command line.
--
From yizhu at cs.uchicago.edu Wed Jul 1 15:42:31 2009
From: yizhu at cs.uchicago.edu (yizhu)
Date: Wed, 01 Jul 2009 15:42:31 -0500
Subject: [Swift-devel] swift error ( gridftp problem)
In-Reply-To:
References: <4A4B2D86.6050608@cs.uchicago.edu>
Message-ID: <4A4BCA37.8070104@cs.uchicago.edu>
Yup, it's my firewall setting problem, now it works, Thanks.
-Yi
Ben Clifford wrote:
> this is almost definitely a firewall problem, with you not having the
> correct ports for gridftp data channels open.
>
> read this:
>
> http://dev.globus.org/wiki/FirewallHowTo
>
> You need to configure an ephemeral port range in your firewall, of maybe
> 1000 ports, and declare it in the GLOBUS_TCP_PORT_RANGE for your server,
> as described here:
>
> http://dev.globus.org/wiki/FirewallHowTo#Configuring_GridFTP_to_use_GLOBUS_TCP_PORT_RANGE
>
> Make sure you can transfer a file with globus-url-copy before attempting
> to run Swift.
>
> This is not a swift-specific problem.
>
From rynge at renci.org Thu Jul 2 11:19:02 2009
From: rynge at renci.org (Mats Rynge)
Date: Thu, 02 Jul 2009 12:19:02 -0400
Subject: [Swift-devel] Patch for swift-osg-ress-site-catalog
Message-ID: <4A4CDDF6.9030101@renci.org>
Swift developers,
Attached is a patch for the swift-osg-ress-site-catalog tool, with a fix
for sites having multiple gatekeepers advertised under the same site name.
--
Mats Rynge
Renaissance Computing Institute
-------------- next part --------------
A non-text attachment was scrubbed...
Name: swift-osg-ress-site-catalog.patch
Type: text/x-diff
Size: 1804 bytes
Desc: not available
URL:
From aespinosa at cs.uchicago.edu Thu Jul 2 14:32:30 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 2 Jul 2009 14:32:30 -0500
Subject: [Swift-devel] workers not initiated on all nodes/cpus in a block
Message-ID: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com>
looking at the submit script before, even though the coaster block
requested for 8 nodes, it still simply runs 1 worker
submit script found:
cat PBS2252235058660926788.submit
#PBS -S /bin/sh
#PBS -N null
#PBS -m n
#PBS -l nodes=8
#PBS -l walltime=00:04:00
#PBS -q short
#PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
#PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
/usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
http://128.135.125.116:47679 0702-050234-000004 1
/bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
the /usr/bin/perl line should be prepended with "pbdsh" or other
equivalent utilities to execute the script on all nodes/cpus. i think
this is the reason why in some instances the block requests more nodes
but not all are active.
host information:
[aespinosa at communicado ~]$ screen -r
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '1122120' (-00:05:07 -> 00:22:53 Duration: 00:28:00)
PE: 8.00 StartPriority: 1800
[aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
Password:
PID TTY STAT TIME COMMAND
31815 ? Ss 0:00 -sh
32054 ? S 0:00 pbs_demux
32229 ? S 0:00 -sh
32230 ? S 0:00 /usr/bin/perl
/home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
http://128.135.125.116:47679 0702-050234-000003 1
32231 ? S 0:00 /usr/bin/perl
/home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
http://128.135.125.116:47679 0702-050234-000003 1
32233 ? S 0:00 /bin/bash
/home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
result/run1416/stdout -err stderr.txt -i -d
|subproblems|result/run1416 -if
template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
-of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
-k -status files -a run1416 template armington.mod
armington_process.cmd armington_output.cmd
subproblems/producer_tree.mod ces.so
32256 ? S 0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
run1416 template armington.mod armington_process.cmd
armington_output.cmd subproblems/producer_tree.mod ces.so
32258 ? S 0:19 ampl arm_test.cmd
32716 ? R 0:37 pathampl /tmp/at32258 -AMPL
32726 ? S 0:00 sshd: aespinosa at notty
32727 ? Rs 0:00 ps x
[aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
Password:
PID TTY STAT TIME COMMAND
30721 ? S 0:00 sshd: aespinosa at pts/0
30722 pts/0 Ss 0:00 -bash
30951 pts/0 S+ 0:00 ssh tp-c105 ps x
30955 ? S 0:00 sshd: aespinosa at notty
30956 ? Rs 0:00 ps x
[aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
of known hosts.
Password:
PID TTY STAT TIME COMMAND
10274 ? S 0:00 sshd: aespinosa at notty
10275 ? Rs 0:00 ps x
...
...
swift session snapshot:
Progress: Selecting site:1014 Submitted:8 Active:1
Progress: Selecting site:1014 Submitted:8 Active:1
Progress: Selecting site:1014 Submitted:8 Active:1
Progress: Selecting site:1014 Submitted:8 Active:1
queue information:
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
1122120 aespinosa Running 8 00:19:53 Thu Jul 2 14:22:19
1 Active Job 171 of 200 Processors Active (85.50%)
100 of 100 Nodes Active (100.00%)
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Thu Jul 2 14:39:03 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 02 Jul 2009 14:39:03 -0500
Subject: [Swift-devel] workers not initiated on all nodes/cpus in a block
In-Reply-To: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com>
References: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com>
Message-ID: <1246563543.4778.0.camel@localhost>
This is with the PBS provider rather than Globus, right?
On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote:
> looking at the submit script before, even though the coaster block
> requested for 8 nodes, it still simply runs 1 worker
>
> submit script found:
> cat PBS2252235058660926788.submit
> #PBS -S /bin/sh
> #PBS -N null
> #PBS -m n
> #PBS -l nodes=8
> #PBS -l walltime=00:04:00
> #PBS -q short
> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000004 1
> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
>
>
> the /usr/bin/perl line should be prepended with "pbdsh" or other
> equivalent utilities to execute the script on all nodes/cpus. i think
> this is the reason why in some instances the block requests more nodes
> but not all are active.
>
> host information:
> [aespinosa at communicado ~]$ screen -r
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> Flags: RESTARTABLE
>
> Reservation '1122120' (-00:05:07 -> 00:22:53 Duration: 00:28:00)
> PE: 8.00 StartPriority: 1800
>
> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
> Password:
> PID TTY STAT TIME COMMAND
> 31815 ? Ss 0:00 -sh
> 32054 ? S 0:00 pbs_demux
> 32229 ? S 0:00 -sh
> 32230 ? S 0:00 /usr/bin/perl
> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000003 1
> 32231 ? S 0:00 /usr/bin/perl
> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000003 1
> 32233 ? S 0:00 /bin/bash
> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
> result/run1416/stdout -err stderr.txt -i -d
> |subproblems|result/run1416 -if
> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
> -k -status files -a run1416 template armington.mod
> armington_process.cmd armington_output.cmd
> subproblems/producer_tree.mod ces.so
> 32256 ? S 0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
> run1416 template armington.mod armington_process.cmd
> armington_output.cmd subproblems/producer_tree.mod ces.so
> 32258 ? S 0:19 ampl arm_test.cmd
> 32716 ? R 0:37 pathampl /tmp/at32258 -AMPL
> 32726 ? S 0:00 sshd: aespinosa at notty
> 32727 ? Rs 0:00 ps x
> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
> Password:
> PID TTY STAT TIME COMMAND
> 30721 ? S 0:00 sshd: aespinosa at pts/0
> 30722 pts/0 Ss 0:00 -bash
> 30951 pts/0 S+ 0:00 ssh tp-c105 ps x
> 30955 ? S 0:00 sshd: aespinosa at notty
> 30956 ? Rs 0:00 ps x
> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
> Are you sure you want to continue connecting (yes/no)? yes
> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
> of known hosts.
> Password:
> PID TTY STAT TIME COMMAND
> 10274 ? S 0:00 sshd: aespinosa at notty
> 10275 ? Rs 0:00 ps x
> ...
> ...
>
>
> swift session snapshot:
> Progress: Selecting site:1014 Submitted:8 Active:1
> Progress: Selecting site:1014 Submitted:8 Active:1
> Progress: Selecting site:1014 Submitted:8 Active:1
> Progress: Selecting site:1014 Submitted:8 Active:1
>
> queue information:
> ACTIVE JOBS--------------------
> JOBNAME USERNAME STATE PROC REMAINING STARTTIME
>
> 1122120 aespinosa Running 8 00:19:53 Thu Jul 2 14:22:19
>
> 1 Active Job 171 of 200 Processors Active (85.50%)
> 100 of 100 Nodes Active (100.00%)
>
>
>
>
>
> --
> Allan M. Espinosa
> PhD student, Computer Science
> University of Chicago
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Thu Jul 2 14:42:25 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 2 Jul 2009 14:42:25 -0500
Subject: [Swift-devel] workers not initiated on all nodes/cpus in a block
In-Reply-To: <1246563543.4778.0.camel@localhost>
References: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com>
<1246563543.4778.0.camel@localhost>
Message-ID: <50b07b4b0907021242y609c8a5wc09a8707f9668f9@mail.gmail.com>
yup pbs provider. i'll checkout if the same goes with the globus gt2 provider.
-Allan
2009/7/2 Mihael Hategan :
> This is with the PBS provider rather than Globus, right?
>
> On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote:
>> looking at the submit script before, even though the coaster block
>> requested for 8 nodes, it still simply runs 1 worker
>>
>> submit script found:
>> ?cat PBS2252235058660926788.submit
>> #PBS -S /bin/sh
>> #PBS -N null
>> #PBS -m n
>> #PBS -l nodes=8
>> #PBS -l walltime=00:04:00
>> #PBS -q short
>> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
>> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
>> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000004 1
>> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
>>
>>
>> the /usr/bin/perl line should be prepended with "pbdsh" or other
>> equivalent utilities to execute the script on all nodes/cpus. i think
>> this is the reason why in some instances the block requests more nodes
>> but not all are active.
>>
>> host information:
>> [aespinosa at communicado ~]$ screen -r
>> IWD: [NONE] ?Executable: ?[NONE]
>> Bypass: 0 ?StartCount: 1
>> PartitionMask: [ALL]
>> Flags: ? ? ? RESTARTABLE
>>
>> Reservation '1122120' (-00:05:07 -> 00:22:53 ?Duration: 00:28:00)
>> PE: ?8.00 ?StartPriority: ?1800
>>
>> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
>> Password:
>> ? PID TTY ? ? ?STAT ? TIME COMMAND
>> 31815 ? ? ? ? ?Ss ? ? 0:00 -sh
>> 32054 ? ? ? ? ?S ? ? ?0:00 pbs_demux
>> 32229 ? ? ? ? ?S ? ? ?0:00 -sh
>> 32230 ? ? ? ? ?S ? ? ?0:00 /usr/bin/perl
>> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000003 1
>> 32231 ? ? ? ? ?S ? ? ?0:00 /usr/bin/perl
>> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000003 1
>> 32233 ? ? ? ? ?S ? ? ?0:00 /bin/bash
>> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
>> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
>> result/run1416/stdout -err stderr.txt -i -d
>> |subproblems|result/run1416 -if
>> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
>> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
>> -k ?-status files -a run1416 template armington.mod
>> armington_process.cmd armington_output.cmd
>> subproblems/producer_tree.mod ces.so
>> 32256 ? ? ? ? ?S ? ? ?0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
>> run1416 template armington.mod armington_process.cmd
>> armington_output.cmd subproblems/producer_tree.mod ces.so
>> 32258 ? ? ? ? ?S ? ? ?0:19 ampl arm_test.cmd
>> 32716 ? ? ? ? ?R ? ? ?0:37 pathampl /tmp/at32258 -AMPL
>> 32726 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at notty
>> 32727 ? ? ? ? ?Rs ? ? 0:00 ps x
>> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
>> Password:
>> ? PID TTY ? ? ?STAT ? TIME COMMAND
>> 30721 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at pts/0
>> 30722 pts/0 ? ?Ss ? ? 0:00 -bash
>> 30951 pts/0 ? ?S+ ? ? 0:00 ssh tp-c105 ps x
>> 30955 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at notty
>> 30956 ? ? ? ? ?Rs ? ? 0:00 ps x
>> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
>> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
>> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
>> Are you sure you want to continue connecting (yes/no)? yes
>> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
>> of known hosts.
>> Password:
>> ? PID TTY ? ? ?STAT ? TIME COMMAND
>> 10274 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at notty
>> 10275 ? ? ? ? ?Rs ? ? 0:00 ps x
>> ...
>> ...
>>
>>
>> swift session snapshot:
>> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1
>> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1
>> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1
>> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1
>>
>> queue information:
>> ACTIVE JOBS--------------------
>> JOBNAME ? ? ? ? ? ?USERNAME ? ? ?STATE ?PROC ? REMAINING ? ? ? ? ? ?STARTTIME
>>
>> 1122120 ? ? ? ? ? ?aespinosa ? ?Running ? ? 8 ? ?00:19:53 ?Thu Jul ?2 14:22:19
>>
>> ? ? ?1 Active Job ? ? ?171 of ?200 Processors Active (85.50%)
>> ? ? ? ? ? ? ? ? ? ? ? ?100 of ?100 Nodes Active ? ? ?(100.00%)
From benc at hawaga.org.uk Fri Jul 3 04:28:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Jul 2009 09:28:01 +0000 (GMT)
Subject: [Swift-devel] Patch for swift-osg-ress-site-catalog
In-Reply-To: <4A4CDDF6.9030101@renci.org>
References: <4A4CDDF6.9030101@renci.org>
Message-ID:
applied r2995
--
From benc at hawaga.org.uk Fri Jul 3 12:35:41 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Jul 2009 17:35:41 +0000 (GMT)
Subject: [Swift-devel] imports
Message-ID:
swift r2996 contains an import directive which will import SwiftScript
code from other .swift files into the current program.
This is done deep in the compiler, and is not a preprocessor.
You can import the samefile multiple times without trouble it will only
be processed once.
At present you can only iport files that are in the current working
directory. $PTH/$CLASSPATH/$PERL5LIB style path handling should be
straightforward to implement, though.
--
From bugzilla-daemon at mcs.anl.gov Wed Jul 8 09:37:39 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 8 Jul 2009 09:37:39 -0500 (CDT)
Subject: [Swift-devel] [Bug 214] New: Enhance logging and debug capabilities
for Condor provider
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=214
Summary: Enhance logging and debug capabilities for Condor
provider
Product: Swift
Version: unspecified
Platform: All
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: wilde at mcs.anl.gov
- create all condor submit files with a log file entry
- add a setting to not delete condor files in .globus/scripts after they
complete,
for debugging
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From wilde at mcs.anl.gov Wed Jul 8 09:38:23 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 08 Jul 2009 09:38:23 -0500
Subject: [Swift-devel] Re: [CI Ticketing System #1226] Condor hung on
communicado
In-Reply-To:
References: <4A4E05A7.8030503@mcs.anl.gov>
<4A54A54D.1010006@mcs.anl.gov>
Message-ID: <4A54AF5F.5050200@mcs.anl.gov>
done.
On 7/8/09 9:20 AM, Ben Clifford wrote:
> On Wed, 8 Jul 2009, Michael Wilde wrote:
>
>> - create all condor submit files with a log file entry
>> - a setting to not delete condor files in .globus/scripts after they complete,
>> for debugging
>
> those would be best entered as enhancement requests into the CoG bugzilla.
>
From rynge at renci.org Fri Jul 10 16:29:31 2009
From: rynge at renci.org (Mats Rynge)
Date: Fri, 10 Jul 2009 17:29:31 -0400
Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold
Message-ID: <4A57B2BB.8030904@renci.org>
Looks like Swift is not cleaning up completed/held Condor-G jobs. There
are more than 1000 jobs in the queue on engage-submit. Some jobs are in
the Done state, some in the Held state.
--
Mats Rynge
Renaissance Computing Institute
From aespinosa at cs.uchicago.edu Fri Jul 10 16:39:48 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Fri, 10 Jul 2009 16:39:48 -0500
Subject: [Swift-devel] Re: Condor-G jobs left in the queue upon
completion/hold
In-Reply-To: <4A57B2BB.8030904@renci.org>
References: <4A57B2BB.8030904@renci.org>
Message-ID: <50b07b4b0907101439o3054aaf0w78f9f420d1bd49c9@mail.gmail.com>
Hi Mats,
Just cleaned my jobs on the queue. I did not realize this when my
jobs finished last night
-Allan
2009/7/10 Mats Rynge :
> Looks like Swift is not cleaning up completed/held Condor-G jobs. There are
> more than 1000 jobs in the queue on engage-submit. Some jobs are in the Done
> state, some in the Held state.
From hategan at mcs.anl.gov Fri Jul 10 16:53:08 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 10 Jul 2009 16:53:08 -0500
Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold
In-Reply-To: <4A57B2BB.8030904@renci.org>
References: <4A57B2BB.8030904@renci.org>
Message-ID: <1247262788.15261.9.camel@localhost>
Yeah. I think job logs might be a better solution to figure job state
than +leave_in_queue, check, -leave_in_queue.
On Fri, 2009-07-10 at 17:29 -0400, Mats Rynge wrote:
> Looks like Swift is not cleaning up completed/held Condor-G jobs. There
> are more than 1000 jobs in the queue on engage-submit. Some jobs are in
> the Done state, some in the Held state.
>
From benc at hawaga.org.uk Sat Jul 11 05:24:19 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 11 Jul 2009 10:24:19 +0000 (GMT)
Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold
In-Reply-To: <1247262788.15261.9.camel@localhost>
References: <4A57B2BB.8030904@renci.org> <1247262788.15261.9.camel@localhost>
Message-ID:
On Fri, 10 Jul 2009, Mihael Hategan wrote:
> Yeah. I think job logs might be a better solution to figure job state
> than +leave_in_queue, check, -leave_in_queue.
Interestingly, Miron said the same to me only yesterday...
Alain Roy is also watching this and agrees.
--
From hategan at mcs.anl.gov Sat Jul 11 10:28:37 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 11 Jul 2009 10:28:37 -0500
Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold
In-Reply-To:
References: <4A57B2BB.8030904@renci.org> <1247262788.15261.9.camel@localhost>
Message-ID: <1247326117.22686.1.camel@localhost>
On Sat, 2009-07-11 at 10:24 +0000, Ben Clifford wrote:
>
> On Fri, 10 Jul 2009, Mihael Hategan wrote:
>
> > Yeah. I think job logs might be a better solution to figure job state
> > than +leave_in_queue, check, -leave_in_queue.
>
> Interestingly, Miron said the same to me only yesterday...
>
> Alain Roy is also watching this and agrees.
>
Though that suffers from its own problems:
1. If a different log is used for every job, lots of files may need to
be tailed at once.
2. If a single file is used for every job, is there any guarantee that
entries in the log are written atomically?
From bugzilla-daemon at mcs.anl.gov Sun Jul 12 16:23:41 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun, 12 Jul 2009 16:23:41 -0500 (CDT)
Subject: [Swift-devel] [Bug 215] New: stdout and stderr redirect for SGE
jobmanager causing failure on stageouts
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=215
Summary: stdout and stderr redirect for SGE jobmanager causing
failure on stageouts
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Windows
Status: NEW
Severity: normal
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: skenny at uchicago.edu
CC: zhaozhang at uchicago.edu
stdout and stderr have been redirected when the SGE job manager is detected.
however, this seems to cause a gram failure:
7/10 00:39:28 JM: sending callback of status 4 (failure code 155) to
https://128.135.92.64:50003/1247203143796.
when this redirection is commented out of the swift code, workflows are running
properly on the ranger TeraGrid site. however, it should be noted that when
redirection is not in place a data.* file is created in the user's $HOME for
each job run (thus, if you run many thousands of jobs, you will have a file for
each one).
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
From bugzilla-daemon at mcs.anl.gov Mon Jul 13 02:20:43 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 13 Jul 2009 02:20:43 -0500 (CDT)
Subject: [Swift-devel] [Bug 216] New: poor compile error when semicolon
missing at end of structure definition
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=216
Summary: poor compile error when semicolon missing at end of
structure definition
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Documentation
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
In the below code, the error message given is unenlightening. It should refer
to something closer to the actual error.
Perhaps the parser should fail as soon as it sees a token after the } that is
not SEMI
Removing files from previous runs
Running test 07554-ext-mapper-struct at Mon Jul 13 09:18:39 CEST 2009
Could not start execution.
Compile error in procedure invocation at line 16: Type messagefile is not
defined.
type messagefile;
type struct {
messagefile eerste;
messagefile twede;
} // MISSING SEMICOLON HERE
(messagefile t) write(string s) {
app {
echo s stdout=@filename(t);
}
}
messagefile outfiles ;
outfiles.eerste = write("1st");
outfiles.twede = write("2nd");
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From bugzilla-daemon at mcs.anl.gov Mon Jul 13 02:23:22 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 13 Jul 2009 02:23:22 -0500 (CDT)
Subject: [Swift-devel] [Bug 216] poor compile error when semicolon missing
at end of structure definition
In-Reply-To:
References:
Message-ID: <20090713072322.144352CC5C@wind.mcs.anl.gov>
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=216
Ben Clifford changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID
--- Comment #1 from Ben Clifford 2009-07-13 02:23:21 ---
actually this bug report is incorrect.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From wilde at mcs.anl.gov Mon Jul 13 11:45:59 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 11:45:59 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
Message-ID: <4A5B64C7.4080802@mcs.anl.gov>
I thought I wrote an email on this, but cant find it, so I will try to
recall what I saw.
Sarah tried a test run to re-create the problem of "excessive overhead
from coasters on the head node". This was spurred by another complaint
from the Ranger sysadmins. The complaint had about the same level of
detail as the first: it was voice mail saying "your processing are
causing too much overhead on the login node".
So we tried to do a test to isolate and quantify what was happening. We
did not get far enough, but got some initial observations.
Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of
50 sleep 300 jobs (approx).
This was around 7PM Thu night Jul 9. Sarah, are these logs still there?
Can you copy the coaster and swift logs to the CI where we can look at them?
What I saw in top (-b -d) and ps was:
- two Java processes were created on login3 (headnode) with her ID
- one was about 275MB virt mem and burning 100% CPU time, continuously
- one was about 1GB virt mem and not burning much time
- tailing the coaster log in Sarah's home directory showed repetitive
activity, seemingly about every second, a burst of "polling-like" messages
- seems like there were about 3-4 GRAM jobmanagers for the 50 jobs,
which would be good, I think (in that it seems like jobs were allocated
in blocks).
At the time we did not have a chance to gather detailed evidence, but I
was surprised by two things:
- that there were two Java processes and that one was so big. (Are most
likely the active process was just a child thread of the main process?)
- that there was continual log activity while the 50 jobs were sleeping.
But I dont have solid evidence that the 50 jobs were actually running
and sleeping.
I think if we correlate the swift log and the coaster log here we might
learn more.
I dont know if this was using Mihael's latest code with a reduced
logging level or not.
Allan, this seems like it should be straightforward to reproduce now, so
please go ahead and try to do that, and capture everything, including
ideally the profile info that Mihael was trying to explain to Zhao how
to capture.
- Mike
From hategan at mcs.anl.gov Mon Jul 13 12:04:02 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 12:04:02 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <4A5B64C7.4080802@mcs.anl.gov>
References: <4A5B64C7.4080802@mcs.anl.gov>
Message-ID: <1247504642.17460.6.camel@localhost>
On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote:
> I thought I wrote an email on this, but cant find it, so I will try to
> recall what I saw.
>
> Sarah tried a test run to re-create the problem of "excessive overhead
> from coasters on the head node". This was spurred by another complaint
> from the Ranger sysadmins. The complaint had about the same level of
> detail as the first: it was voice mail saying "your processing are
> causing too much overhead on the login node".
>
> So we tried to do a test to isolate and quantify what was happening. We
> did not get far enough, but got some initial observations.
>
> Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of
> 50 sleep 300 jobs (approx).
>
> This was around 7PM Thu night Jul 9. Sarah, are these logs still there?
> Can you copy the coaster and swift logs to the CI where we can look at them?
>
> What I saw in top (-b -d) and ps was:
>
> - two Java processes were created on login3 (headnode) with her ID
> - one was about 275MB virt mem and burning 100% CPU time, continuously
> - one was about 1GB virt mem and not burning much time
> - tailing the coaster log in Sarah's home directory showed repetitive
> activity, seemingly about every second, a burst of "polling-like" messages
> - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs,
> which would be good, I think (in that it seems like jobs were allocated
> in blocks).
>
> At the time we did not have a chance to gather detailed evidence, but I
> was surprised by two things:
>
> - that there were two Java processes and that one was so big. (Are most
> likely the active process was just a child thread of the main process?)
One java process is the bootstrap process (it downloads the coaster
jars, sets up the environment and runs the coaster service). It has
always been like this. Did you happen to capture the output of ps to a
file? That would be useful, because from what you are suggesting, it
appears that the bootstrap process is eating 100% CPU. That process
should only be sleeping after the service is started.
>
> - that there was continual log activity
By some very odd definition of "continual". The schedule is re-computed
periodically. The messages also tell you how much time it takes to
re-compute the schedule, which divided by the pause interval should give
you the maximum CPU usage for the process for a time period, other
things ignored. In the idle state, this takes around 1ms (0.1% CPU
usage).
> while the 50 jobs were sleeping.
> But I dont have solid evidence that the 50 jobs were actually running
> and sleeping.
>
> I think if we correlate the swift log and the coaster log here we might
> learn more.
>
> I dont know if this was using Mihael's latest code with a reduced
> logging level or not.
>
> Allan, this seems like it should be straightforward to reproduce now, so
> please go ahead and try to do that, and capture everything, including
> ideally the profile info that Mihael was trying to explain to Zhao how
> to capture.
>
> - Mike
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Mon Jul 13 12:28:54 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 12:28:54 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247504642.17460.6.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost>
Message-ID: <4A5B6ED6.60508@mcs.anl.gov>
On 7/13/09 12:04 PM, Mihael Hategan wrote:
> On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote:
>> I thought I wrote an email on this, but cant find it, so I will try to
>> recall what I saw.
>>
>> Sarah tried a test run to re-create the problem of "excessive overhead
>> from coasters on the head node". This was spurred by another complaint
>> from the Ranger sysadmins. The complaint had about the same level of
>> detail as the first: it was voice mail saying "your processing are
>> causing too much overhead on the login node".
>>
>> So we tried to do a test to isolate and quantify what was happening. We
>> did not get far enough, but got some initial observations.
>>
>> Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of
>> 50 sleep 300 jobs (approx).
>>
>> This was around 7PM Thu night Jul 9. Sarah, are these logs still there?
>> Can you copy the coaster and swift logs to the CI where we can look at them?
>>
>> What I saw in top (-b -d) and ps was:
>>
>> - two Java processes were created on login3 (headnode) with her ID
>> - one was about 275MB virt mem and burning 100% CPU time, continuously
>> - one was about 1GB virt mem and not burning much time
>> - tailing the coaster log in Sarah's home directory showed repetitive
>> activity, seemingly about every second, a burst of "polling-like" messages
>> - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs,
>> which would be good, I think (in that it seems like jobs were allocated
>> in blocks).
>>
>> At the time we did not have a chance to gather detailed evidence, but I
>> was surprised by two things:
>>
>> - that there were two Java processes and that one was so big. (Are most
>> likely the active process was just a child thread of the main process?)
>
> One java process is the bootstrap process (it downloads the coaster
> jars, sets up the environment and runs the coaster service). It has
> always been like this. Did you happen to capture the output of ps to a
> file? That would be useful, because from what you are suggesting, it
> appears that the bootstrap process is eating 100% CPU. That process
> should only be sleeping after the service is started.
I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
locate it.
As best as I can recall it showed the larger memory-footprint process to
be relatively idle, and the smaller footprint process (about 275MB) to
be burning 100% of a CPU. Allan will try to get a snapshot of this shortly.
If this observation if correct, whats the best way to find out where its
spinning? Profiling? Debug logging? Can you get profiling data from a
JVM that doesnt exit?
- Mike
>
>> - that there was continual log activity
>
> By some very odd definition of "continual". The schedule is re-computed
> periodically. The messages also tell you how much time it takes to
> re-compute the schedule, which divided by the pause interval should give
> you the maximum CPU usage for the process for a time period, other
> things ignored. In the idle state, this takes around 1ms (0.1% CPU
> usage).
>
>> while the 50 jobs were sleeping.
>> But I dont have solid evidence that the 50 jobs were actually running
>> and sleeping.
>>
>> I think if we correlate the swift log and the coaster log here we might
>> learn more.
>>
>> I dont know if this was using Mihael's latest code with a reduced
>> logging level or not.
>>
>> Allan, this seems like it should be straightforward to reproduce now, so
>> please go ahead and try to do that, and capture everything, including
>> ideally the profile info that Mihael was trying to explain to Zhao how
>> to capture.
>>
>> - Mike
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Mon Jul 13 13:23:15 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 13:23:15 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <4A5B6ED6.60508@mcs.anl.gov>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
Message-ID: <1247509395.20144.4.camel@localhost>
On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
> >>
> >> At the time we did not have a chance to gather detailed evidence, but I
> >> was surprised by two things:
> >>
> >> - that there were two Java processes and that one was so big. (Are most
> >> likely the active process was just a child thread of the main process?)
> >
> > One java process is the bootstrap process (it downloads the coaster
> > jars, sets up the environment and runs the coaster service). It has
> > always been like this. Did you happen to capture the output of ps to a
> > file? That would be useful, because from what you are suggesting, it
> > appears that the bootstrap process is eating 100% CPU. That process
> > should only be sleeping after the service is started.
>
> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
> locate it.
>
> As best as I can recall it showed the larger memory-footprint process to
> be relatively idle, and the smaller footprint process (about 275MB) to
> be burning 100% of a CPU.
Normally, the smaller footprint process should be the bootstrap. But
that's why I would like the ps output, because it sounds odd.
> Allan will try to get a snapshot of this shortly.
>
> If this observation if correct, whats the best way to find out where its
> spinning? Profiling? Debug logging? Can you get profiling data from a
> JVM that doesnt exit?
Once I know where it is, I can look at the code and then we'll go from
there.
From aespinosa at cs.uchicago.edu Mon Jul 13 13:55:18 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 13 Jul 2009 13:55:18 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247509395.20144.4.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost>
<4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost>
Message-ID: <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
here process 22395 is the child of the main java process
(bootstrap.jar) and is loading the CPU.
I have coasters.log, worker-*log, swift logs, gram logs in
~aespinosa/workflows/activelog/run06. This refers to a different run.
PID 15206 is the child java process of bootstrap.jar in here.
top snapshot:
top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55
Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana
14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man
ps snapshot:
22328 ? S 0:00 \_ /bin/bash
22364 ? Sl 0:00 \_
/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
-Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
-DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
/tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
https://128.135.125.17:46519 11505253269
22395 ? SNl 6:29 \_
/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
-DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
-Djava.security.egd=file:///dev/urandom -cp
/home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
2009/7/13 Mihael Hategan :
> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>> >>
>> >> At the time we did not have a chance to gather detailed evidence, but I
>> >> was surprised by two things:
>> >>
>> >> - that there were two Java processes and that one was so big. (Are most
>> >> likely the active process was just a child thread of the main process?)
>> >
>> > One java process is the bootstrap process (it downloads the coaster
>> > jars, sets up the environment and runs the coaster service). It has
>> > always been like this. Did you happen to capture the output of ps to a
>> > file? That would be useful, because from what you are suggesting, it
>> > appears that the bootstrap process is eating 100% CPU. That process
>> > should only be sleeping after the service is started.
>>
>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
>> locate it.
>>
>> As best as I can recall it showed the larger memory-footprint process to
>> be relatively idle, and the smaller footprint process (about 275MB) to
>> be burning 100% of a CPU.
>
> Normally, the smaller footprint process should be the bootstrap. But
> that's why I would like the ps output, because it sounds odd.
>
>> ? Allan will try to get a snapshot of this shortly.
>>
>> If this observation if correct, whats the best way to find out where its
>> spinning? Profiling? Debug logging? Can you get profiling data from a
>> JVM that doesnt exit?
>
> Once I know where it is, I can look at the code and then we'll go from
> there.
>
>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Mon Jul 13 14:06:09 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 14:06:09 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
Message-ID: <1247511969.21171.4.camel@localhost>
A while ago I committed a patch to run the service process with a lower
priority. Is that in use?
Also, is logging reduced or is it the default?
Is the 97% CPU usage a spike, or does it stay there on average?
Can I take a look at the coaster logs from skenny's run on ranger?
I'd also like to point out in as little offensive mode as I can, that
I'm working 100% on I2U2 and my lack of getting more than lightly
involved in this is a consequence of that.
On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
> here process 22395 is the child of the main java process
> (bootstrap.jar) and is loading the CPU.
>
> I have coasters.log, worker-*log, swift logs, gram logs in
> ~aespinosa/workflows/activelog/run06. This refers to a different run.
> PID 15206 is the child java process of bootstrap.jar in here.
>
> top snapshot:
> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55
> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana
> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man
>
> ps snapshot:
>
> 22328 ? S 0:00 \_ /bin/bash
> 22364 ? Sl 0:00 \_
> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
> https://128.135.125.17:46519 11505253269
> 22395 ? SNl 6:29 \_
> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
> -Djava.security.egd=file:///dev/urandom -cp
> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>
>
>
> 2009/7/13 Mihael Hategan :
> > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
> >> >>
> >> >> At the time we did not have a chance to gather detailed evidence, but I
> >> >> was surprised by two things:
> >> >>
> >> >> - that there were two Java processes and that one was so big. (Are most
> >> >> likely the active process was just a child thread of the main process?)
> >> >
> >> > One java process is the bootstrap process (it downloads the coaster
> >> > jars, sets up the environment and runs the coaster service). It has
> >> > always been like this. Did you happen to capture the output of ps to a
> >> > file? That would be useful, because from what you are suggesting, it
> >> > appears that the bootstrap process is eating 100% CPU. That process
> >> > should only be sleeping after the service is started.
> >>
> >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
> >> locate it.
> >>
> >> As best as I can recall it showed the larger memory-footprint process to
> >> be relatively idle, and the smaller footprint process (about 275MB) to
> >> be burning 100% of a CPU.
> >
> > Normally, the smaller footprint process should be the bootstrap. But
> > that's why I would like the ps output, because it sounds odd.
> >
> >> Allan will try to get a snapshot of this shortly.
> >>
> >> If this observation if correct, whats the best way to find out where its
> >> spinning? Profiling? Debug logging? Can you get profiling data from a
> >> JVM that doesnt exit?
> >
> > Once I know where it is, I can look at the code and then we'll go from
> > there.
> >
> >
> >
>
>
>
From wilde at mcs.anl.gov Mon Jul 13 14:11:34 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 14:11:34 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247511969.21171.4.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
Message-ID: <4A5B86E6.2000803@mcs.anl.gov>
On 7/13/09 2:06 PM, Mihael Hategan wrote:
> A while ago I committed a patch to run the service process with a lower
> priority. Is that in use?
Looks like 22395 is running with a nice value of 10 which I think is
what you set in that patch: 22395 aespinos 25 10
>
> Also, is logging reduced or is it the default?
>
> Is the 97% CPU usage a spike, or does it stay there on average?
>
> Can I take a look at the coaster logs from skenny's run on ranger?
>
> I'd also like to point out in as little offensive mode as I can, that
> I'm working 100% on I2U2 and my lack of getting more than lightly
> involved in this is a consequence of that.
Right, understood. Any pointers you can give are welcome, and Allan and
I are expecting to do the legwork. We'll at least try to find out where
the overhead is coming from.
- Mike
>
> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
>> here process 22395 is the child of the main java process
>> (bootstrap.jar) and is loading the CPU.
>>
>> I have coasters.log, worker-*log, swift logs, gram logs in
>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
>> PID 15206 is the child java process of bootstrap.jar in here.
>>
>> top snapshot:
>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55
>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana
>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man
>>
>> ps snapshot:
>>
>> 22328 ? S 0:00 \_ /bin/bash
>> 22364 ? Sl 0:00 \_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
>> https://128.135.125.17:46519 11505253269
>> 22395 ? SNl 6:29 \_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>> -Djava.security.egd=file:///dev/urandom -cp
>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec94
6b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:
/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_se
rvice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>
>>
>>
>> 2009/7/13 Mihael Hategan :
>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>> At the time we did not have a chance to gather detailed evidence, but I
>>>>>> was surprised by two things:
>>>>>>
>>>>>> - that there were two Java processes and that one was so big. (Are most
>>>>>> likely the active process was just a child thread of the main process?)
>>>>> One java process is the bootstrap process (it downloads the coaster
>>>>> jars, sets up the environment and runs the coaster service). It has
>>>>> always been like this. Did you happen to capture the output of ps to a
>>>>> file? That would be useful, because from what you are suggesting, it
>>>>> appears that the bootstrap process is eating 100% CPU. That process
>>>>> should only be sleeping after the service is started.
>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
>>>> locate it.
>>>>
>>>> As best as I can recall it showed the larger memory-footprint process to
>>>> be relatively idle, and the smaller footprint process (about 275MB) to
>>>> be burning 100% of a CPU.
>>> Normally, the smaller footprint process should be the bootstrap. But
>>> that's why I would like the ps output, because it sounds odd.
>>>
>>>> Allan will try to get a snapshot of this shortly.
>>>>
>>>> If this observation if correct, whats the best way to find out where its
>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
>>>> JVM that doesnt exit?
>>> Once I know where it is, I can look at the code and then we'll go from
>>> there.
>>>
>>>
>>>
>>
>>
>
From aespinosa at cs.uchicago.edu Mon Jul 13 14:12:44 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 13 Jul 2009 14:12:44 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247511969.21171.4.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost>
<4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
Message-ID: <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
97% is an average as can be seen in run06. swift version is r3005 and
cogkit r2410. this is a vanilla build of swift.
2009/7/13 Mihael Hategan :
> A while ago I committed a patch to run the service process with a lower
> priority. Is that in use?
>
> Also, is logging reduced or is it the default?
>
> Is the 97% CPU usage a spike, or does it stay there on average?
>
> Can I take a look at the coaster logs from skenny's run on ranger?
>
> I'd also like to point out in as little offensive mode as I can, that
> I'm working 100% on I2U2 and my lack of getting more than lightly
> involved in this is a consequence of that.
>
> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. ?From
>> here process 22395 is the child of the main java process
>> (bootstrap.jar) and is loading the CPU.
>>
>> I have coasters.log, worker-*log, swift logs, gram logs in
>> ~aespinosa/workflows/activelog/run06. ?This refers to a different run.
>> ?PID 15206 is the child java process of bootstrap.jar in here.
>>
>> top snapshot:
>> top - 13:49:03 up 55 days, ?1:45, ?1 user, ?load average: 1.18, 0.80, 0.55
>> Tasks: 121 total, ? 1 running, 120 sleeping, ? 0 stopped, ? 0 zombie
>> Cpu(s): ?7.5%us, ?2.8%sy, 48.7%ni, 41.0%id, ?0.0%wa, ?0.0%hi, ?0.0%si, ?0.0%st
>> Mem: ? 4058916k total, ?3889864k used, ? 169052k free, ? 239688k buffers
>> Swap: ?4192956k total, ? ? ? 96k used, ?4192860k free, ?2504812k cached
>>
>> ? PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?COMMAND
>> 22395 aespinos ?25 ?10 ?525m ?91m ?13m S 97.5 ?2.3 ? 4:29.22 java
>> 22217 aespinos ?15 ? 0 10736 1048 ?776 R ?0.3 ?0.0 ? 0:00.50 top
>> 22243 aespinos ?16 ? 0 ?102m 5576 3536 S ?0.3 ?0.1 ? 0:00.10 globus-job-mana
>> 14764 aespinos ?15 ? 0 98024 1744 ?976 S ?0.0 ?0.0 ? 0:00.06 sshd
>> 14765 aespinos ?15 ? 0 65364 2796 1176 S ?0.0 ?0.1 ? 0:00.18 bash
>> 22326 aespinos ?18 ? 0 ?8916 1052 ?852 S ?0.0 ?0.0 ? 0:00.00 bash
>> 22328 aespinos ?19 ? 0 ?8916 1116 ?908 S ?0.0 ?0.0 ? 0:00.00 bash
>> 22364 aespinos ?15 ? 0 1222m ?18m 8976 S ?0.0 ?0.5 ? 0:00.20 java
>> 22444 aespinos ?16 ? 0 ?102m 5684 3528 S ?0.0 ?0.1 ? 0:00.09 globus-job-man
>>
>> ps snapshot:
>>
>> 22328 ? ? ? ? ?S ? ? ?0:00 ?\_ /bin/bash
>> 22364 ? ? ? ? ?Sl ? ? 0:00 ? ? ?\_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
>> https://128.135.125.17:46519 11505253269
>> 22395 ? ? ? ? ?SNl ? ?6:29 ? ? ? ? ?\_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>> -Djava.security.egd=file:///dev/urandom -cp
>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>
>>
>>
>> 2009/7/13 Mihael Hategan :
>> > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>> >> >>
>> >> >> At the time we did not have a chance to gather detailed evidence, but I
>> >> >> was surprised by two things:
>> >> >>
>> >> >> - that there were two Java processes and that one was so big. (Are most
>> >> >> likely the active process was just a child thread of the main process?)
>> >> >
>> >> > One java process is the bootstrap process (it downloads the coaster
>> >> > jars, sets up the environment and runs the coaster service). It has
>> >> > always been like this. Did you happen to capture the output of ps to a
>> >> > file? That would be useful, because from what you are suggesting, it
>> >> > appears that the bootstrap process is eating 100% CPU. That process
>> >> > should only be sleeping after the service is started.
>> >>
>> >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
>> >> locate it.
>> >>
>> >> As best as I can recall it showed the larger memory-footprint process to
>> >> be relatively idle, and the smaller footprint process (about 275MB) to
>> >> be burning 100% of a CPU.
>> >
>> > Normally, the smaller footprint process should be the bootstrap. But
>> > that's why I would like the ps output, because it sounds odd.
>> >
>> >> ? Allan will try to get a snapshot of this shortly.
>> >>
>> >> If this observation if correct, whats the best way to find out where its
>> >> spinning? Profiling? Debug logging? Can you get profiling data from a
>> >> JVM that doesnt exit?
>> >
>> > Once I know where it is, I can look at the code and then we'll go from
>> > there.
>> >
>> >
>> >
>>
>>
>>
>
>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From wilde at mcs.anl.gov Mon Jul 13 14:18:05 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 14:18:05 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247511969.21171.4.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
Message-ID: <4A5B886D.1020400@mcs.anl.gov>
On 7/13/09 2:06 PM, Mihael Hategan wrote:
> A while ago I committed a patch to run the service process with a lower
> priority. Is that in use?
>
> Also, is logging reduced or is it the default?
>
> Is the 97% CPU usage a spike, or does it stay there on average?
In the test I observed Sarah running last Thu, it stayed close to 100%
during the whole run - many minutes, solid near-100% CPU. During that
time a tail of the coaster log showed a burst of a few messages every
few seconds - not intensive enough to explain the overhead as all due to
logging.
Allan will need to comment on the runs he describes below.
- Mike
>
> Can I take a look at the coaster logs from skenny's run on ranger?
>
> I'd also like to point out in as little offensive mode as I can, that
> I'm working 100% on I2U2 and my lack of getting more than lightly
> involved in this is a consequence of that.
>
> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
>> here process 22395 is the child of the main java process
>> (bootstrap.jar) and is loading the CPU.
>>
>> I have coasters.log, worker-*log, swift logs, gram logs in
>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
>> PID 15206 is the child java process of bootstrap.jar in here.
>>
>> top snapshot:
>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55
>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana
>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man
>>
>> ps snapshot:
>>
>> 22328 ? S 0:00 \_ /bin/bash
>> 22364 ? Sl 0:00 \_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
>> https://128.135.125.17:46519 11505253269
>> 22395 ? SNl 6:29 \_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>> -Djava.security.egd=file:///dev/urandom -cp
>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec94
6b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:
/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_se
rvice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>
>>
>>
>> 2009/7/13 Mihael Hategan :
>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>> At the time we did not have a chance to gather detailed evidence, but I
>>>>>> was surprised by two things:
>>>>>>
>>>>>> - that there were two Java processes and that one was so big. (Are most
>>>>>> likely the active process was just a child thread of the main process?)
>>>>> One java process is the bootstrap process (it downloads the coaster
>>>>> jars, sets up the environment and runs the coaster service). It has
>>>>> always been like this. Did you happen to capture the output of ps to a
>>>>> file? That would be useful, because from what you are suggesting, it
>>>>> appears that the bootstrap process is eating 100% CPU. That process
>>>>> should only be sleeping after the service is started.
>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
>>>> locate it.
>>>>
>>>> As best as I can recall it showed the larger memory-footprint process to
>>>> be relatively idle, and the smaller footprint process (about 275MB) to
>>>> be burning 100% of a CPU.
>>> Normally, the smaller footprint process should be the bootstrap. But
>>> that's why I would like the ps output, because it sounds odd.
>>>
>>>> Allan will try to get a snapshot of this shortly.
>>>>
>>>> If this observation if correct, whats the best way to find out where its
>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
>>>> JVM that doesnt exit?
>>> Once I know where it is, I can look at the code and then we'll go from
>>> there.
>>>
>>>
>>>
>>
>>
>
From hategan at mcs.anl.gov Mon Jul 13 14:24:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 14:24:25 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <4A5B86E6.2000803@mcs.anl.gov>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost> <4A5B86E6.2000803@mcs.anl.gov>
Message-ID: <1247513065.21484.8.camel@localhost>
On Mon, 2009-07-13 at 14:11 -0500, Michael Wilde wrote:
> On 7/13/09 2:06 PM, Mihael Hategan wrote:
> > A while ago I committed a patch to run the service process with a lower
> > priority. Is that in use?
>
> Looks like 22395 is running with a nice value of 10 which I think is
> what you set in that patch: 22395 aespinos 25 10
Ok. Now, lower priority doesn't mean it won't use CPU. It means that
other processes with a higher priority will get preferential treatment,
and if there is CPU left and the coasters need it, it will be used.
In other words, near 100% CPU usage isn't in itself a problem. While it
shouldn't stay there according to my understanding of the code, if that
is the only problem observed, then I think it's an overreaction.
> >
> > Also, is logging reduced or is it the default?
> >
> > Is the 97% CPU usage a spike, or does it stay there on average?
> >
> > Can I take a look at the coaster logs from skenny's run on ranger?
> >
> > I'd also like to point out in as little offensive mode as I can, that
> > I'm working 100% on I2U2 and my lack of getting more than lightly
> > involved in this is a consequence of that.
>
> Right, understood. Any pointers you can give are welcome, and Allan and
> I are expecting to do the legwork. We'll at least try to find out where
> the overhead is coming from.
I find it somewhat odd that there was a process with 1GB of virtual
memory use. Are you sure that wasn't a WSRF container from somebody
else?
Can we switch to exclusive evidence mode here (i.e. nothing is
considered unless there is clear proof of it, like a screen dump or log
output, or copy an paste of session from a terminal)?
From hategan at mcs.anl.gov Mon Jul 13 14:25:44 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 14:25:44 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
Message-ID: <1247513144.21484.10.camel@localhost>
On Mon, 2009-07-13 at 14:12 -0500, Allan Espinosa wrote:
> 97% is an average as can be seen in run06. swift version is r3005 and
> cogkit r2410. this is a vanilla build of swift.
Can you run with reduced logging? We established before that logging
appears to be a problem and before we eliminate that it's wasteful to
continue guessing.
>
> 2009/7/13 Mihael Hategan :
> > A while ago I committed a patch to run the service process with a lower
> > priority. Is that in use?
> >
> > Also, is logging reduced or is it the default?
> >
> > Is the 97% CPU usage a spike, or does it stay there on average?
> >
> > Can I take a look at the coaster logs from skenny's run on ranger?
> >
> > I'd also like to point out in as little offensive mode as I can, that
> > I'm working 100% on I2U2 and my lack of getting more than lightly
> > involved in this is a consequence of that.
> >
> > On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
> >> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
> >> here process 22395 is the child of the main java process
> >> (bootstrap.jar) and is loading the CPU.
> >>
> >> I have coasters.log, worker-*log, swift logs, gram logs in
> >> ~aespinosa/workflows/activelog/run06. This refers to a different run.
> >> PID 15206 is the child java process of bootstrap.jar in here.
> >>
> >> top snapshot:
> >> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55
> >> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
> >> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> >> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
> >> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
> >> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
> >> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana
> >> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
> >> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
> >> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
> >> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
> >> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
> >> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man
> >>
> >> ps snapshot:
> >>
> >> 22328 ? S 0:00 \_ /bin/bash
> >> 22364 ? Sl 0:00 \_
> >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> >> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
> >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> >> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
> >> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
> >> https://128.135.125.17:46519 11505253269
> >> 22395 ? SNl 6:29 \_
> >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
> >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> >> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
> >> -Djava.security.egd=file:///dev/urandom -cp
> >> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
> >>
> >>
> >>
> >> 2009/7/13 Mihael Hategan :
> >> > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
> >> >> >>
> >> >> >> At the time we did not have a chance to gather detailed evidence, but I
> >> >> >> was surprised by two things:
> >> >> >>
> >> >> >> - that there were two Java processes and that one was so big. (Are most
> >> >> >> likely the active process was just a child thread of the main process?)
> >> >> >
> >> >> > One java process is the bootstrap process (it downloads the coaster
> >> >> > jars, sets up the environment and runs the coaster service). It has
> >> >> > always been like this. Did you happen to capture the output of ps to a
> >> >> > file? That would be useful, because from what you are suggesting, it
> >> >> > appears that the bootstrap process is eating 100% CPU. That process
> >> >> > should only be sleeping after the service is started.
> >> >>
> >> >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
> >> >> locate it.
> >> >>
> >> >> As best as I can recall it showed the larger memory-footprint process to
> >> >> be relatively idle, and the smaller footprint process (about 275MB) to
> >> >> be burning 100% of a CPU.
> >> >
> >> > Normally, the smaller footprint process should be the bootstrap. But
> >> > that's why I would like the ps output, because it sounds odd.
> >> >
> >> >> Allan will try to get a snapshot of this shortly.
> >> >>
> >> >> If this observation if correct, whats the best way to find out where its
> >> >> spinning? Profiling? Debug logging? Can you get profiling data from a
> >> >> JVM that doesnt exit?
> >> >
> >> > Once I know where it is, I can look at the code and then we'll go from
> >> > there.
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >
> >
> >
>
>
>
From hategan at mcs.anl.gov Mon Jul 13 14:30:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 14:30:40 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <4A5B886D.1020400@mcs.anl.gov>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost> <4A5B886D.1020400@mcs.anl.gov>
Message-ID: <1247513440.21484.16.camel@localhost>
On Mon, 2009-07-13 at 14:18 -0500, Michael Wilde wrote:
> On 7/13/09 2:06 PM, Mihael Hategan wrote:
> > A while ago I committed a patch to run the service process with a lower
> > priority. Is that in use?
> >
> > Also, is logging reduced or is it the default?
> >
> > Is the 97% CPU usage a spike, or does it stay there on average?
>
> In the test I observed Sarah running last Thu, it stayed close to 100%
> during the whole run - many minutes, solid near-100% CPU. During that
> time a tail of the coaster log showed a burst of a few messages every
> few seconds - not intensive enough to explain the overhead as all due to
> logging.
Ok, that does look like a problem. I need to see the log from that.
In addition, when you observe this SPECIFIC behavior (solid near 100%
CPU, burst of a few messages every few seconds and not much else in the
logs), please do a jstack on the process in question and send the
output of that.
From hategan at mcs.anl.gov Mon Jul 13 14:34:04 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 14:34:04 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247513440.21484.16.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost> <4A5B886D.1020400@mcs.anl.gov>
<1247513440.21484.16.camel@localhost>
Message-ID: <1247513644.21484.18.camel@localhost>
On Mon, 2009-07-13 at 14:30 -0500, Mihael Hategan wrote:
> On Mon, 2009-07-13 at 14:18 -0500, Michael Wilde wrote:
> > On 7/13/09 2:06 PM, Mihael Hategan wrote:
> > > A while ago I committed a patch to run the service process with a lower
> > > priority. Is that in use?
> > >
> > > Also, is logging reduced or is it the default?
> > >
> > > Is the 97% CPU usage a spike, or does it stay there on average?
> >
> > In the test I observed Sarah running last Thu, it stayed close to 100%
> > during the whole run - many minutes, solid near-100% CPU. During that
> > time a tail of the coaster log showed a burst of a few messages every
> > few seconds - not intensive enough to explain the overhead as all due to
> > logging.
>
> Ok, that does look like a problem. I need to see the log from that.
However, I want to stress out that it may NOT be the same problem in all
cases of high CPU usage. So reduced logging should still be used before
trying to reproduce this specific problem.
>
> In addition, when you observe this SPECIFIC behavior (solid near 100%
> CPU, burst of a few messages every few seconds and not much else in the
> logs), please do a jstack on the process in question and send the
> output of that.
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Mon Jul 13 17:04:35 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 13 Jul 2009 17:04:35 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <4A5BA007.2050101@mcs.anl.gov>
References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost>
<4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
Message-ID: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
hi,
here is a patch which solves the cpu usage on the bootstrap coaster
service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
suggested svn log entry:
Added locks via wait() and notify() to prevent busy waiting/
active polling in the block task queue.
Test 2000 touch job using 066-many.swift via local:local :
before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
CPU usage drops from 100% to 0% with a few 25-40 % spikes!
-Allan
2009/7/13 Michael Wilde :
> Hi Allan,
>
> I think the methods you want for synchronization are part of class Object.
>
> They are documented in the chapter Threads and Locks of The Java Language
> Specification:
>
> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>
> queue.wait() should be called if the queue is empty.
>
> queue.notify() or .notifyall() should be called when something is added to
> the queue. I think notify() should work.
>
> .wait will I think take a timer, but suspect you dont need that.
>
> Both should be called within the synchronized(queue) constructs that are
> already in the code.
>
> Should be fun to fix this!
>
> - Mike
>
>
>
>
>
> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>
>> 97% is an average as can be seen in run06. ?swift version is r3005 and
>> cogkit r2410. ?this is a vanilla build of swift.
>>
>> 2009/7/13 Mihael Hategan :
>>>
>>> A while ago I committed a patch to run the service process with a lower
>>> priority. Is that in use?
>>>
>>> Also, is logging reduced or is it the default?
>>>
>>> Is the 97% CPU usage a spike, or does it stay there on average?
>>>
>>> Can I take a look at the coaster logs from skenny's run on ranger?
>>>
>>> I'd also like to point out in as little offensive mode as I can, that
>>> I'm working 100% on I2U2 and my lack of getting more than lightly
>>> involved in this is a consequence of that.
>>>
>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>
>>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. ?From
>>>> here process 22395 is the child of the main java process
>>>> (bootstrap.jar) and is loading the CPU.
>>>>
>>>> I have coasters.log, worker-*log, swift logs, gram logs in
>>>> ~aespinosa/workflows/activelog/run06. ?This refers to a different run.
>>>> ?PID 15206 is the child java process of bootstrap.jar in here.
>>>>
>>>> top snapshot:
>>>> top - 13:49:03 up 55 days, ?1:45, ?1 user, ?load average: 1.18, 0.80,
>>>> 0.55
>>>> Tasks: 121 total, ? 1 running, 120 sleeping, ? 0 stopped, ? 0 zombie
>>>> Cpu(s): ?7.5%us, ?2.8%sy, 48.7%ni, 41.0%id, ?0.0%wa, ?0.0%hi, ?0.0%si,
>>>> ?0.0%st
>>>> Mem: ? 4058916k total, ?3889864k used, ? 169052k free, ? 239688k buffers
>>>> Swap: ?4192956k total, ? ? ? 96k used, ?4192860k free, ?2504812k cached
>>>>
>>>> ?PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?COMMAND
>>>> 22395 aespinos ?25 ?10 ?525m ?91m ?13m S 97.5 ?2.3 ? 4:29.22 java
>>>> 22217 aespinos ?15 ? 0 10736 1048 ?776 R ?0.3 ?0.0 ? 0:00.50 top
>>>> 22243 aespinos ?16 ? 0 ?102m 5576 3536 S ?0.3 ?0.1 ? 0:00.10
>>>> globus-job-mana
>>>> 14764 aespinos ?15 ? 0 98024 1744 ?976 S ?0.0 ?0.0 ? 0:00.06 sshd
>>>> 14765 aespinos ?15 ? 0 65364 2796 1176 S ?0.0 ?0.1 ? 0:00.18 bash
>>>> 22326 aespinos ?18 ? 0 ?8916 1052 ?852 S ?0.0 ?0.0 ? 0:00.00 bash
>>>> 22328 aespinos ?19 ? 0 ?8916 1116 ?908 S ?0.0 ?0.0 ? 0:00.00 bash
>>>> 22364 aespinos ?15 ? 0 1222m ?18m 8976 S ?0.0 ?0.5 ? 0:00.20 java
>>>> 22444 aespinos ?16 ? 0 ?102m 5684 3528 S ?0.0 ?0.1 ? 0:00.09
>>>> globus-job-man
>>>>
>>>> ps snapshot:
>>>>
>>>> 22328 ? ? ? ? ?S ? ? ?0:00 ?\_ /bin/bash
>>>> 22364 ? ? ? ? ?Sl ? ? 0:00 ? ? ?\_
>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
>>>>
>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
>>>> https://128.135.125.17:46519 11505253269
>>>> 22395 ? ? ? ? ?SNl ? ?6:29 ? ? ? ? ?\_
>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>
>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>
>>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec9
>
> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar
> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_s
> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>
>>>>
>>>>
>>>> 2009/7/13 Mihael Hategan :
>>>>>
>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>
>>>>>>>> At the time we did not have a chance to gather detailed evidence,
>>>>>>>> but I
>>>>>>>> was surprised by two things:
>>>>>>>>
>>>>>>>> - that there were two Java processes and that one was so big. (Are
>>>>>>>> most
>>>>>>>> likely the active process was just a child thread of the main
>>>>>>>> process?)
>>>>>>>
>>>>>>> One java process is the bootstrap process (it downloads the coaster
>>>>>>> jars, sets up the environment and runs the coaster service). It has
>>>>>>> always been like this. Did you happen to capture the output of ps to
>>>>>>> a
>>>>>>> file? That would be useful, because from what you are suggesting, it
>>>>>>> appears that the bootstrap process is eating 100% CPU. That process
>>>>>>> should only be sleeping after the service is started.
>>>>>>
>>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I
>>>>>> cant
>>>>>> locate it.
>>>>>>
>>>>>> As best as I can recall it showed the larger memory-footprint process
>>>>>> to
>>>>>> be relatively idle, and the smaller footprint process (about 275MB) to
>>>>>> be burning 100% of a CPU.
>>>>>
>>>>> Normally, the smaller footprint process should be the bootstrap. But
>>>>> that's why I would like the ps output, because it sounds odd.
>>>>>
>>>>>> ?Allan will try to get a snapshot of this shortly.
>>>>>>
>>>>>> If this observation if correct, whats the best way to find out where
>>>>>> its
>>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
>>>>>> JVM that doesnt exit?
>>>>>
>>>>> Once I know where it is, I can look at the code and then we'll go from
>>>>> there.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From wilde at mcs.anl.gov Mon Jul 13 17:17:26 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 17:17:26 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
<50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
Message-ID: <4A5BB276.1070902@mcs.anl.gov>
Nice! Now lets beat it up and see how well it works.
Sarah: Allan did not encounter the error messages you mentioned to me.
I suggest you do this:
- post to the devel list the messages you got
- test this patch to see if it clears up the problem
Mike
On 7/13/09 5:04 PM, Allan Espinosa wrote:
> hi,
>
> here is a patch which solves the cpu usage on the bootstrap coaster
> service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>
> suggested svn log entry:
> Added locks via wait() and notify() to prevent busy waiting/
> active polling in the block task queue.
>
>
> Test 2000 touch job using 066-many.swift via local:local :
> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>
> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>
> -Allan
>
>
> 2009/7/13 Michael Wilde :
>> Hi Allan,
>>
>> I think the methods you want for synchronization are part of class Object.
>>
>> They are documented in the chapter Threads and Locks of The Java Language
>> Specification:
>>
>> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>>
>> queue.wait() should be called if the queue is empty.
>>
>> queue.notify() or .notifyall() should be called when something is added to
>> the queue. I think notify() should work.
>>
>> .wait will I think take a timer, but suspect you dont need that.
>>
>> Both should be called within the synchronized(queue) constructs that are
>> already in the code.
>>
>> Should be fun to fix this!
>>
>> - Mike
>>
>>
>>
>>
>>
>> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>> 97% is an average as can be seen in run06. swift version is r3005 and
>>> cogkit r2410. this is a vanilla build of swift.
>>>
>>> 2009/7/13 Mihael Hategan :
>>>> A while ago I committed a patch to run the service process with a lower
>>>> priority. Is that in use?
>>>>
>>>> Also, is logging reduced or is it the default?
>>>>
>>>> Is the 97% CPU usage a spike, or does it stay there on average?
>>>>
>>>> Can I take a look at the coaster logs from skenny's run on ranger?
>>>>
>>>> I'd also like to point out in as little offensive mode as I can, that
>>>> I'm working 100% on I2U2 and my lack of getting more than lightly
>>>> involved in this is a consequence of that.
>>>>
>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
>>>>> here process 22395 is the child of the main java process
>>>>> (bootstrap.jar) and is loading the CPU.
>>>>>
>>>>> I have coasters.log, worker-*log, swift logs, gram logs in
>>>>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
>>>>> PID 15206 is the child java process of bootstrap.jar in here.
>>>>>
>>>>> top snapshot:
>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80,
>>>>> 0.55
>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si,
>>>>> 0.0%st
>>>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
>>>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
>>>>>
>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10
>>>>> globus-job-mana
>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09
>>>>> globus-job-man
>>>>>
>>>>> ps snapshot:
>>>>>
>>>>> 22328 ? S 0:00 \_ /bin/bash
>>>>> 22364 ? Sl 0:00 \_
>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
>>>>>
>>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
>>>>> https://128.135.125.17:46519 11505253269
>>>>> 22395 ? SNl 6:29 \_
>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>>
>>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>>
>>>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde
c9
>> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.
jar
>> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou
s_s
>> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>>
>>>>>
>>>>> 2009/7/13 Mihael Hategan :
>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>> At the time we did not have a chance to gather detailed evidence,
>>>>>>>>> but I
>>>>>>>>> was surprised by two things:
>>>>>>>>>
>>>>>>>>> - that there were two Java processes and that one was so big. (Are
>>>>>>>>> most
>>>>>>>>> likely the active process was just a child thread of the main
>>>>>>>>> process?)
>>>>>>>> One java process is the bootstrap process (it downloads the coaster
>>>>>>>> jars, sets up the environment and runs the coaster service). It has
>>>>>>>> always been like this. Did you happen to capture the output of ps to
>>>>>>>> a
>>>>>>>> file? That would be useful, because from what you are suggesting, it
>>>>>>>> appears that the bootstrap process is eating 100% CPU. That process
>>>>>>>> should only be sleeping after the service is started.
>>>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I
>>>>>>> cant
>>>>>>> locate it.
>>>>>>>
>>>>>>> As best as I can recall it showed the larger memory-footprint process
>>>>>>> to
>>>>>>> be relatively idle, and the smaller footprint process (about 275MB) to
>>>>>>> be burning 100% of a CPU.
>>>>>> Normally, the smaller footprint process should be the bootstrap. But
>>>>>> that's why I would like the ps output, because it sounds odd.
>>>>>>
>>>>>>> Allan will try to get a snapshot of this shortly.
>>>>>>>
>>>>>>> If this observation if correct, whats the best way to find out where
>>>>>>> its
>>>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
>>>>>>> JVM that doesnt exit?
>>>>>> Once I know where it is, I can look at the code and then we'll go from
>>>>>> there.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
>
From hategan at mcs.anl.gov Mon Jul 13 17:34:07 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 17:34:07 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
<50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
Message-ID: <1247524447.25358.0.camel@localhost>
Holly matrimony! I will go sit in a corner now.
Very nice work Allan.
Mihael
On Mon, 2009-07-13 at 17:04 -0500, Allan Espinosa wrote:
> hi,
>
> here is a patch which solves the cpu usage on the bootstrap coaster
> service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>
> suggested svn log entry:
> Added locks via wait() and notify() to prevent busy waiting/
> active polling in the block task queue.
>
>
> Test 2000 touch job using 066-many.swift via local:local :
> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>
> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>
> -Allan
>
>
> 2009/7/13 Michael Wilde :
> > Hi Allan,
> >
> > I think the methods you want for synchronization are part of class Object.
> >
> > They are documented in the chapter Threads and Locks of The Java Language
> > Specification:
> >
> > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
> >
> > queue.wait() should be called if the queue is empty.
> >
> > queue.notify() or .notifyall() should be called when something is added to
> > the queue. I think notify() should work.
> >
> > .wait will I think take a timer, but suspect you dont need that.
> >
> > Both should be called within the synchronized(queue) constructs that are
> > already in the code.
> >
> > Should be fun to fix this!
> >
> > - Mike
> >
> >
> >
> >
> >
> > On 7/13/09 2:12 PM, Allan Espinosa wrote:
> >>
> >> 97% is an average as can be seen in run06. swift version is r3005 and
> >> cogkit r2410. this is a vanilla build of swift.
> >>
> >> 2009/7/13 Mihael Hategan :
> >>>
> >>> A while ago I committed a patch to run the service process with a lower
> >>> priority. Is that in use?
> >>>
> >>> Also, is logging reduced or is it the default?
> >>>
> >>> Is the 97% CPU usage a spike, or does it stay there on average?
> >>>
> >>> Can I take a look at the coaster logs from skenny's run on ranger?
> >>>
> >>> I'd also like to point out in as little offensive mode as I can, that
> >>> I'm working 100% on I2U2 and my lack of getting more than lightly
> >>> involved in this is a consequence of that.
> >>>
> >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
> >>>>
> >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
> >>>> here process 22395 is the child of the main java process
> >>>> (bootstrap.jar) and is loading the CPU.
> >>>>
> >>>> I have coasters.log, worker-*log, swift logs, gram logs in
> >>>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
> >>>> PID 15206 is the child java process of bootstrap.jar in here.
> >>>>
> >>>> top snapshot:
> >>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80,
> >>>> 0.55
> >>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
> >>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> >>>> 0.0%st
> >>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
> >>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
> >>>>
> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
> >>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
> >>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10
> >>>> globus-job-mana
> >>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
> >>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
> >>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
> >>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
> >>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
> >>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09
> >>>> globus-job-man
> >>>>
> >>>> ps snapshot:
> >>>>
> >>>> 22328 ? S 0:00 \_ /bin/bash
> >>>> 22364 ? Sl 0:00 \_
> >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
> >>>>
> >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
> >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
> >>>> https://128.135.125.17:46519 11505253269
> >>>> 22395 ? SNl 6:29 \_
> >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
> >>>>
> >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
> >>>> -Djava.security.egd=file:///dev/urandom -cp
> >>>>
> >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcd
ec9
> >
> > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c
.jar
> > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvo
us_s
> > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
> >>>>
> >>>>
> >>>>
> >>>> 2009/7/13 Mihael Hategan :
> >>>>>
> >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
> >>>>>>>>
> >>>>>>>> At the time we did not have a chance to gather detailed evidence,
> >>>>>>>> but I
> >>>>>>>> was surprised by two things:
> >>>>>>>>
> >>>>>>>> - that there were two Java processes and that one was so big. (Are
> >>>>>>>> most
> >>>>>>>> likely the active process was just a child thread of the main
> >>>>>>>> process?)
> >>>>>>>
> >>>>>>> One java process is the bootstrap process (it downloads the coaster
> >>>>>>> jars, sets up the environment and runs the coaster service). It has
> >>>>>>> always been like this. Did you happen to capture the output of ps to
> >>>>>>> a
> >>>>>>> file? That would be useful, because from what you are suggesting, it
> >>>>>>> appears that the bootstrap process is eating 100% CPU. That process
> >>>>>>> should only be sleeping after the service is started.
> >>>>>>
> >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I
> >>>>>> cant
> >>>>>> locate it.
> >>>>>>
> >>>>>> As best as I can recall it showed the larger memory-footprint process
> >>>>>> to
> >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to
> >>>>>> be burning 100% of a CPU.
> >>>>>
> >>>>> Normally, the smaller footprint process should be the bootstrap. But
> >>>>> that's why I would like the ps output, because it sounds odd.
> >>>>>
> >>>>>> Allan will try to get a snapshot of this shortly.
> >>>>>>
> >>>>>> If this observation if correct, whats the best way to find out where
> >>>>>> its
> >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
> >>>>>> JVM that doesnt exit?
> >>>>>
> >>>>> Once I know where it is, I can look at the code and then we'll go from
> >>>>> there.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>
>
>
From hategan at mcs.anl.gov Mon Jul 13 17:41:42 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 17:41:42 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
<50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
Message-ID: <1247524902.25358.3.camel@localhost>
A slightly modified version of this is in cog r2429.
Thanks again,
Mihael
On Mon, 2009-07-13 at 17:04 -0500, Allan Espinosa wrote:
> hi,
>
> here is a patch which solves the cpu usage on the bootstrap coaster
> service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>
> suggested svn log entry:
> Added locks via wait() and notify() to prevent busy waiting/
> active polling in the block task queue.
>
>
> Test 2000 touch job using 066-many.swift via local:local :
> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>
> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>
> -Allan
>
>
> 2009/7/13 Michael Wilde :
> > Hi Allan,
> >
> > I think the methods you want for synchronization are part of class Object.
> >
> > They are documented in the chapter Threads and Locks of The Java Language
> > Specification:
> >
> > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
> >
> > queue.wait() should be called if the queue is empty.
> >
> > queue.notify() or .notifyall() should be called when something is added to
> > the queue. I think notify() should work.
> >
> > .wait will I think take a timer, but suspect you dont need that.
> >
> > Both should be called within the synchronized(queue) constructs that are
> > already in the code.
> >
> > Should be fun to fix this!
> >
> > - Mike
> >
> >
> >
> >
> >
> > On 7/13/09 2:12 PM, Allan Espinosa wrote:
> >>
> >> 97% is an average as can be seen in run06. swift version is r3005 and
> >> cogkit r2410. this is a vanilla build of swift.
> >>
> >> 2009/7/13 Mihael Hategan :
> >>>
> >>> A while ago I committed a patch to run the service process with a lower
> >>> priority. Is that in use?
> >>>
> >>> Also, is logging reduced or is it the default?
> >>>
> >>> Is the 97% CPU usage a spike, or does it stay there on average?
> >>>
> >>> Can I take a look at the coaster logs from skenny's run on ranger?
> >>>
> >>> I'd also like to point out in as little offensive mode as I can, that
> >>> I'm working 100% on I2U2 and my lack of getting more than lightly
> >>> involved in this is a consequence of that.
> >>>
> >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
> >>>>
> >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
> >>>> here process 22395 is the child of the main java process
> >>>> (bootstrap.jar) and is loading the CPU.
> >>>>
> >>>> I have coasters.log, worker-*log, swift logs, gram logs in
> >>>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
> >>>> PID 15206 is the child java process of bootstrap.jar in here.
> >>>>
> >>>> top snapshot:
> >>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80,
> >>>> 0.55
> >>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
> >>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> >>>> 0.0%st
> >>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
> >>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
> >>>>
> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
> >>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
> >>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10
> >>>> globus-job-mana
> >>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
> >>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
> >>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
> >>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
> >>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
> >>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09
> >>>> globus-job-man
> >>>>
> >>>> ps snapshot:
> >>>>
> >>>> 22328 ? S 0:00 \_ /bin/bash
> >>>> 22364 ? Sl 0:00 \_
> >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
> >>>>
> >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
> >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
> >>>> https://128.135.125.17:46519 11505253269
> >>>> 22395 ? SNl 6:29 \_
> >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
> >>>>
> >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
> >>>> -Djava.security.egd=file:///dev/urandom -cp
> >>>>
> >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcd
ec9
> >
> > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c
.jar
> > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvo
us_s
> > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
> >>>>
> >>>>
> >>>>
> >>>> 2009/7/13 Mihael Hategan :
> >>>>>
> >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
> >>>>>>>>
> >>>>>>>> At the time we did not have a chance to gather detailed evidence,
> >>>>>>>> but I
> >>>>>>>> was surprised by two things:
> >>>>>>>>
> >>>>>>>> - that there were two Java processes and that one was so big. (Are
> >>>>>>>> most
> >>>>>>>> likely the active process was just a child thread of the main
> >>>>>>>> process?)
> >>>>>>>
> >>>>>>> One java process is the bootstrap process (it downloads the coaster
> >>>>>>> jars, sets up the environment and runs the coaster service). It has
> >>>>>>> always been like this. Did you happen to capture the output of ps to
> >>>>>>> a
> >>>>>>> file? That would be useful, because from what you are suggesting, it
> >>>>>>> appears that the bootstrap process is eating 100% CPU. That process
> >>>>>>> should only be sleeping after the service is started.
> >>>>>>
> >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I
> >>>>>> cant
> >>>>>> locate it.
> >>>>>>
> >>>>>> As best as I can recall it showed the larger memory-footprint process
> >>>>>> to
> >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to
> >>>>>> be burning 100% of a CPU.
> >>>>>
> >>>>> Normally, the smaller footprint process should be the bootstrap. But
> >>>>> that's why I would like the ps output, because it sounds odd.
> >>>>>
> >>>>>> Allan will try to get a snapshot of this shortly.
> >>>>>>
> >>>>>> If this observation if correct, whats the best way to find out where
> >>>>>> its
> >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
> >>>>>> JVM that doesnt exit?
> >>>>>
> >>>>> Once I know where it is, I can look at the code and then we'll go from
> >>>>> there.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>
>
>
From skenny at uchicago.edu Mon Jul 13 17:41:50 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Mon, 13 Jul 2009 17:41:50 -0500 (CDT)
Subject: [Swift-devel] Coaster CPU-time consumption
issue
In-Reply-To: <4A5BB276.1070902@mcs.anl.gov>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
<50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
<4A5BB276.1070902@mcs.anl.gov>
Message-ID: <20090713174150.CAD74976@m4500-02.uchicago.edu>
cool, i'll give this a shot now too.
it's possible the other err i mentioned to you mike, was
actually related to the stdout redirection. i wanted to test
more, but trying not to wreak havoc on the headnode :P anyway,
if this works, i can do more testing and will post if i'm
still getting the error.
~sk
---- Original message ----
>Date: Mon, 13 Jul 2009 17:17:26 -0500
>From: Michael Wilde
>Subject: Re: [Swift-devel] Coaster CPU-time consumption issue
>To: Allan Espinosa , Sarah Kenny
>Cc: swift-devel
>
>Nice! Now lets beat it up and see how well it works.
>
>Sarah: Allan did not encounter the error messages you
mentioned to me.
>
>I suggest you do this:
>
>- post to the devel list the messages you got
>
>- test this patch to see if it clears up the problem
>
>Mike
>
>
>On 7/13/09 5:04 PM, Allan Espinosa wrote:
>> hi,
>>
>> here is a patch which solves the cpu usage on the bootstrap
coaster
>> service:
http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>>
>> suggested svn log entry:
>> Added locks via wait() and notify() to prevent busy
waiting/
>> active polling in the block task queue.
>>
>>
>> Test 2000 touch job using 066-many.swift via local:local :
>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>>
>> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>>
>> -Allan
>>
>>
>> 2009/7/13 Michael Wilde :
>>> Hi Allan,
>>>
>>> I think the methods you want for synchronization are part
of class Object.
>>>
>>> They are documented in the chapter Threads and Locks of
The Java Language
>>> Specification:
>>>
>>>
http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>>>
>>> queue.wait() should be called if the queue is empty.
>>>
>>> queue.notify() or .notifyall() should be called when
something is added to
>>> the queue. I think notify() should work.
>>>
>>> .wait will I think take a timer, but suspect you dont need
that.
>>>
>>> Both should be called within the synchronized(queue)
constructs that are
>>> already in the code.
>>>
>>> Should be fun to fix this!
>>>
>>> - Mike
>>>
>>>
>>>
>>>
>>>
>>> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>>> 97% is an average as can be seen in run06. swift version
is r3005 and
>>>> cogkit r2410. this is a vanilla build of swift.
>>>>
>>>> 2009/7/13 Mihael Hategan :
>>>>> A while ago I committed a patch to run the service
process with a lower
>>>>> priority. Is that in use?
>>>>>
>>>>> Also, is logging reduced or is it the default?
>>>>>
>>>>> Is the 97% CPU usage a spike, or does it stay there on
average?
>>>>>
>>>>> Can I take a look at the coaster logs from skenny's run
on ranger?
>>>>>
>>>>> I'd also like to point out in as little offensive mode
as I can, that
>>>>> I'm working 100% on I2U2 and my lack of getting more
than lightly
>>>>> involved in this is a consequence of that.
>>>>>
>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored
tp-osg. From
>>>>>> here process 22395 is the child of the main java process
>>>>>> (bootstrap.jar) and is loading the CPU.
>>>>>>
>>>>>> I have coasters.log, worker-*log, swift logs, gram logs in
>>>>>> ~aespinosa/workflows/activelog/run06. This refers to a
different run.
>>>>>> PID 15206 is the child java process of bootstrap.jar
in here.
>>>>>>
>>>>>> top snapshot:
>>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load
average: 1.18, 0.80,
>>>>>> 0.55
>>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0
stopped, 0 zombie
>>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa,
0.0%hi, 0.0%si,
>>>>>> 0.0%st
>>>>>> Mem: 4058916k total, 3889864k used, 169052k free,
239688k buffers
>>>>>> Swap: 4192956k total, 96k used, 4192860k free,
2504812k cached
>>>>>>
>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
>>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3
4:29.22 java
>>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0
0:00.50 top
>>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1
0:00.10
>>>>>> globus-job-mana
>>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0
0:00.06 sshd
>>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1
0:00.18 bash
>>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0
0:00.00 bash
>>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0
0:00.00 bash
>>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5
0:00.20 java
>>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1
0:00.09
>>>>>> globus-job-man
>>>>>>
>>>>>> ps snapshot:
>>>>>>
>>>>>> 22328 ? S 0:00 \_ /bin/bash
>>>>>> 22364 ? Sl 0:00 \_
>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
-DGLOBUS_TCP_PORT_RANGE=
>>>>>>
>>>>>>
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>> -DX509_CERT_DIR=
-DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>>>> /tmp/bootstrap.w22332
http://communicado.ci.uchicago.edu:46520
>>>>>> https://128.135.125.17:46519 11505253269
>>>>>> 22395 ? SNl 6:29 \_
>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>>>
>>>>>>
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>>>
>>>>>>
/home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde
>c9
>>>
46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.
>jar
>>>
:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou
>s_s
>>>
ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>>>
>>>>>>
>>>>>> 2009/7/13 Mihael Hategan :
>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>>> At the time we did not have a chance to gather
detailed evidence,
>>>>>>>>>> but I
>>>>>>>>>> was surprised by two things:
>>>>>>>>>>
>>>>>>>>>> - that there were two Java processes and that one
was so big. (Are
>>>>>>>>>> most
>>>>>>>>>> likely the active process was just a child thread
of the main
>>>>>>>>>> process?)
>>>>>>>>> One java process is the bootstrap process (it
downloads the coaster
>>>>>>>>> jars, sets up the environment and runs the coaster
service). It has
>>>>>>>>> always been like this. Did you happen to capture the
output of ps to
>>>>>>>>> a
>>>>>>>>> file? That would be useful, because from what you
are suggesting, it
>>>>>>>>> appears that the bootstrap process is eating 100%
CPU. That process
>>>>>>>>> should only be sleeping after the service is started.
>>>>>>>> I *thought* I captured the output of "top -u
sarahs'id -b -d" but I
>>>>>>>> cant
>>>>>>>> locate it.
>>>>>>>>
>>>>>>>> As best as I can recall it showed the larger
memory-footprint process
>>>>>>>> to
>>>>>>>> be relatively idle, and the smaller footprint process
(about 275MB) to
>>>>>>>> be burning 100% of a CPU.
>>>>>>> Normally, the smaller footprint process should be the
bootstrap. But
>>>>>>> that's why I would like the ps output, because it
sounds odd.
>>>>>>>
>>>>>>>> Allan will try to get a snapshot of this shortly.
>>>>>>>>
>>>>>>>> If this observation if correct, whats the best way to
find out where
>>>>>>>> its
>>>>>>>> spinning? Profiling? Debug logging? Can you get
profiling data from a
>>>>>>>> JVM that doesnt exit?
>>>>>>> Once I know where it is, I can look at the code and
then we'll go from
>>>>>>> there.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
From wilde at mcs.anl.gov Mon Jul 13 18:00:33 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 18:00:33 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <20090713174150.CAD74976@m4500-02.uchicago.edu>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
<50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
<4A5BB276.1070902@mcs.anl.gov>
<20090713174150.CAD74976@m4500-02.uchicago.edu>
Message-ID: <4A5BBC91.9060005@mcs.anl.gov>
OK. Best to test with Mihael's cog r2429.
I hope that this ends the latest head-node havoc :)
Please post either way, so we know if the other problem remains or not.
Thanks,
- Mike
On 7/13/09 5:41 PM, skenny at uchicago.edu wrote:
> cool, i'll give this a shot now too.
>
> it's possible the other err i mentioned to you mike, was
> actually related to the stdout redirection. i wanted to test
> more, but trying not to wreak havoc on the headnode :P anyway,
> if this works, i can do more testing and will post if i'm
> still getting the error.
>
> ~sk
>
> ---- Original message ----
>> Date: Mon, 13 Jul 2009 17:17:26 -0500
>> From: Michael Wilde
>> Subject: Re: [Swift-devel] Coaster CPU-time consumption issue
>> To: Allan Espinosa , Sarah Kenny
>
>> Cc: swift-devel
>>
>> Nice! Now lets beat it up and see how well it works.
>>
>> Sarah: Allan did not encounter the error messages you
> mentioned to me.
>> I suggest you do this:
>>
>> - post to the devel list the messages you got
>>
>> - test this patch to see if it clears up the problem
>>
>> Mike
>>
>>
>> On 7/13/09 5:04 PM, Allan Espinosa wrote:
>>> hi,
>>>
>>> here is a patch which solves the cpu usage on the bootstrap
> coaster
>>> service:
> http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>>> suggested svn log entry:
>>> Added locks via wait() and notify() to prevent busy
> waiting/
>>> active polling in the block task queue.
>>>
>>>
>>> Test 2000 touch job using 066-many.swift via local:local :
>>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
>>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>>>
>>> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>>>
>>> -Allan
>>>
>>>
>>> 2009/7/13 Michael Wilde :
>>>> Hi Allan,
>>>>
>>>> I think the methods you want for synchronization are part
> of class Object.
>>>> They are documented in the chapter Threads and Locks of
> The Java Language
>>>> Specification:
>>>>
>>>>
> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>>>> queue.wait() should be called if the queue is empty.
>>>>
>>>> queue.notify() or .notifyall() should be called when
> something is added to
>>>> the queue. I think notify() should work.
>>>>
>>>> .wait will I think take a timer, but suspect you dont need
> that.
>>>> Both should be called within the synchronized(queue)
> constructs that are
>>>> already in the code.
>>>>
>>>> Should be fun to fix this!
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>>>> 97% is an average as can be seen in run06. swift version
> is r3005 and
>>>>> cogkit r2410. this is a vanilla build of swift.
>>>>>
>>>>> 2009/7/13 Mihael Hategan :
>>>>>> A while ago I committed a patch to run the service
> process with a lower
>>>>>> priority. Is that in use?
>>>>>>
>>>>>> Also, is logging reduced or is it the default?
>>>>>>
>>>>>> Is the 97% CPU usage a spike, or does it stay there on
> average?
>>>>>> Can I take a look at the coaster logs from skenny's run
> on ranger?
>>>>>> I'd also like to point out in as little offensive mode
> as I can, that
>>>>>> I'm working 100% on I2U2 and my lack of getting more
> than lightly
>>>>>> involved in this is a consequence of that.
>>>>>>
>>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored
> tp-osg. From
>>>>>>> here process 22395 is the child of the main java process
>>>>>>> (bootstrap.jar) and is loading the CPU.
>>>>>>>
>>>>>>> I have coasters.log, worker-*log, swift logs, gram logs in
>>>>>>> ~aespinosa/workflows/activelog/run06. This refers to a
> different run.
>>>>>>> PID 15206 is the child java process of bootstrap.jar
> in here.
>>>>>>> top snapshot:
>>>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load
> average: 1.18, 0.80,
>>>>>>> 0.55
>>>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0
> stopped, 0 zombie
>>>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa,
> 0.0%hi, 0.0%si,
>>>>>>> 0.0%st
>>>>>>> Mem: 4058916k total, 3889864k used, 169052k free,
> 239688k buffers
>>>>>>> Swap: 4192956k total, 96k used, 4192860k free,
> 2504812k cached
>>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM
> TIME+ COMMAND
>>>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3
> 4:29.22 java
>>>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0
> 0:00.50 top
>>>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1
> 0:00.10
>>>>>>> globus-job-mana
>>>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0
> 0:00.06 sshd
>>>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1
> 0:00.18 bash
>>>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0
> 0:00.00 bash
>>>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0
> 0:00.00 bash
>>>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5
> 0:00.20 java
>>>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1
> 0:00.09
>>>>>>> globus-job-man
>>>>>>>
>>>>>>> ps snapshot:
>>>>>>>
>>>>>>> 22328 ? S 0:00 \_ /bin/bash
>>>>>>> 22364 ? Sl 0:00 \_
>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> -DGLOBUS_TCP_PORT_RANGE=
>>>>>>>
> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>> -DX509_CERT_DIR=
> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>>>>> /tmp/bootstrap.w22332
> http://communicado.ci.uchicago.edu:46520
>>>>>>> https://128.135.125.17:46519 11505253269
>>>>>>> 22395 ? SNl 6:29 \_
>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>>>>
>>>>>>>
> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>>>>
>>>>>>>
> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde
>> c9
> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.
>> jar
> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou
>> s_s
> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>>>>
>>>>>>> 2009/7/13 Mihael Hategan :
>>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>>>> At the time we did not have a chance to gather
> detailed evidence,
>>>>>>>>>>> but I
>>>>>>>>>>> was surprised by two things:
>>>>>>>>>>>
>>>>>>>>>>> - that there were two Java processes and that one
> was so big. (Are
>>>>>>>>>>> most
>>>>>>>>>>> likely the active process was just a child thread
> of the main
>>>>>>>>>>> process?)
>>>>>>>>>> One java process is the bootstrap process (it
> downloads the coaster
>>>>>>>>>> jars, sets up the environment and runs the coaster
> service). It has
>>>>>>>>>> always been like this. Did you happen to capture the
> output of ps to
>>>>>>>>>> a
>>>>>>>>>> file? That would be useful, because from what you
> are suggesting, it
>>>>>>>>>> appears that the bootstrap process is eating 100%
> CPU. That process
>>>>>>>>>> should only be sleeping after the service is started.
>>>>>>>>> I *thought* I captured the output of "top -u
> sarahs'id -b -d" but I
>>>>>>>>> cant
>>>>>>>>> locate it.
>>>>>>>>>
>>>>>>>>> As best as I can recall it showed the larger
> memory-footprint process
>>>>>>>>> to
>>>>>>>>> be relatively idle, and the smaller footprint process
> (about 275MB) to
>>>>>>>>> be burning 100% of a CPU.
>>>>>>>> Normally, the smaller footprint process should be the
> bootstrap. But
>>>>>>>> that's why I would like the ps output, because it
> sounds odd.
>>>>>>>>> Allan will try to get a snapshot of this shortly.
>>>>>>>>>
>>>>>>>>> If this observation if correct, whats the best way to
> find out where
>>>>>>>>> its
>>>>>>>>> spinning? Profiling? Debug logging? Can you get
> profiling data from a
>>>>>>>>> JVM that doesnt exit?
>>>>>>>> Once I know where it is, I can look at the code and
> then we'll go from
>>>>>>>> there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>
>>>
From tiberius at ci.uchicago.edu Mon Jul 13 18:21:22 2009
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Mon, 13 Jul 2009 18:21:22 -0500
Subject: [Swift-devel] Functionality request: best effort execution
Message-ID:
Hi Swift team
I am curious if there is a way of coding up (or having in the near
future) the following functionality:
(file output) applicationWrapper(file input){
appOutput = runAtomicApplication(input);
dummyOutput = runTimer ();
if (Atomic Application Finished First){
output = appOutput;
} else {
output = dummyOutput;
}
}
I am not sure how to tell swift to stop waiting for the second task,
as soon as the first one has completed successfully.
Thank you
Tibi
--
Tiberiu (Tibi) Stef-Praun, PhD
Computational Sciences Researcher
Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/
From hategan at mcs.anl.gov Mon Jul 13 18:59:24 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 18:59:24 -0500
Subject: [Swift-devel] Functionality request: best effort execution
In-Reply-To:
References:
Message-ID: <1247529564.27051.3.camel@localhost>
That somewhat crosses over the fence of
time-agnostic/sequence-independent nature that swift is in.
Can you implement this as part of your application (i.e. a wrapper
script)?
On Mon, 2009-07-13 at 18:21 -0500, Tiberiu Stef-Praun wrote:
> Hi Swift team
>
> I am curious if there is a way of coding up (or having in the near
> future) the following functionality:
>
> (file output) applicationWrapper(file input){
> appOutput = runAtomicApplication(input);
> dummyOutput = runTimer ();
>
> if (Atomic Application Finished First){
> output = appOutput;
> } else {
> output = dummyOutput;
> }
> }
>
>
> I am not sure how to tell swift to stop waiting for the second task,
> as soon as the first one has completed successfully.
>
> Thank you
> Tibi
>
From wilde at mcs.anl.gov Mon Jul 13 19:05:38 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Jul 2009 19:05:38 -0500
Subject: [Swift-devel] Functionality request: best effort execution
In-Reply-To: <1247529564.27051.3.camel@localhost>
References:
<1247529564.27051.3.camel@localhost>
Message-ID: <4A5BCBD2.2070600@mcs.anl.gov>
On 7/13/09 6:59 PM, Mihael Hategan wrote:
> That somewhat crosses over the fence of
> time-agnostic/sequence-independent nature that swift is in.
>
> Can you implement this as part of your application (i.e. a wrapper
> script)?
I agree - I think the logic below could be done in a shell script fairly
simply, Tibi.
- Mike
>
> On Mon, 2009-07-13 at 18:21 -0500, Tiberiu Stef-Praun wrote:
>> Hi Swift team
>>
>> I am curious if there is a way of coding up (or having in the near
>> future) the following functionality:
>>
>> (file output) applicationWrapper(file input){
>> appOutput = runAtomicApplication(input);
>> dummyOutput = runTimer ();
>>
>> if (Atomic Application Finished First){
>> output = appOutput;
>> } else {
>> output = dummyOutput;
>> }
>> }
>>
>>
>> I am not sure how to tell swift to stop waiting for the second task,
>> as soon as the first one has completed successfully.
>>
>> Thank you
>> Tibi
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From tiberius at ci.uchicago.edu Mon Jul 13 20:22:57 2009
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Mon, 13 Jul 2009 20:22:57 -0500
Subject: [Swift-devel] Functionality request: best effort execution
In-Reply-To: <4A5BCBD2.2070600@mcs.anl.gov>
References:
<1247529564.27051.3.camel@localhost> <4A5BCBD2.2070600@mcs.anl.gov>
Message-ID:
I am trying to control for runaway tasks, not just to simulate them.
The scenario is for tasks which are waiting in the queue, in which
case the wrapper script will not be able to implement the timeout
functionality (because the tasks are not executed yet).
For this reason, I wanted Swift to be aware of time-limited jobs, and
give up on them without an error message (by defaulting to a "dummy"
output).
I am wondering if I can use globus::maxwalltime as a timeout mechanism ?
My current solution is to have a task run locally and the other one
remotely, and to use the local tasks' timeout as a barrier to
generating the dummy output or to validating the remote result as the
proper output.
I know I am pushing the limits here, that's what I pretty much do all
the time with Swift.
Tibi
On Mon, Jul 13, 2009 at 7:05 PM, Michael Wilde wrote:
>
>
> On 7/13/09 6:59 PM, Mihael Hategan wrote:
>>
>> That somewhat crosses over the fence of
>> time-agnostic/sequence-independent nature that swift is in.
>>
>> Can you implement this as part of your application (i.e. a wrapper
>> script)?
>
> I agree - I think the logic below could be done in a shell script fairly
> simply, Tibi.
>
> - Mike
>
>>
>> On Mon, 2009-07-13 at 18:21 -0500, Tiberiu Stef-Praun wrote:
>>>
>>> Hi Swift team
>>>
>>> I am curious if there is a way of coding up (or having in the near
>>> future) the following functionality:
>>>
>>> (file output) applicationWrapper(file input){
>>> ? appOutput = runAtomicApplication(input);
>>> ? dummyOutput = runTimer ();
>>>
>>> ? if (Atomic Application Finished First){
>>> ? ? ?output = appOutput;
>>> ?} else {
>>> ? ? output = dummyOutput;
>>> ?}
>>> }
>>>
>>>
>>> I am not sure how to tell swift to stop waiting for the second task,
>>> as soon as the first one has completed successfully.
>>>
>>> Thank you
>>> Tibi
>>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
--
Tiberiu (Tibi) Stef-Praun, PhD
Computational Sciences Researcher
Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/
From hategan at mcs.anl.gov Mon Jul 13 20:52:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 20:52:22 -0500
Subject: [Swift-devel] Functionality request: best effort execution
In-Reply-To:
References:
<1247529564.27051.3.camel@localhost> <4A5BCBD2.2070600@mcs.anl.gov>
Message-ID: <1247536342.28535.20.camel@localhost>
On Mon, 2009-07-13 at 20:22 -0500, Tiberiu Stef-Praun wrote:
> I am trying to control for runaway tasks, not just to simulate them.
> The scenario is for tasks which are waiting in the queue, in which
> case the wrapper script will not be able to implement the timeout
> functionality (because the tasks are not executed yet).
> For this reason, I wanted Swift to be aware of time-limited jobs, and
> give up on them without an error message (by defaulting to a "dummy"
> output).
It would make your workflow nondeterministic depending on the resources
you run, including possibly giving you only dummy results without as
much as a single complaint. Are you sure this is what you want?
In a sense, with swift, I think we're trying to eliminate this kind of
nondeterministic behavior that is common in strict language concurrency,
but that also means we need to restrict certain things.
I can see applications to this in that there are problems that are
time-sensitive (some things may only be useful if done before a certain
deadline).
So I'm unsure about the following:
- whether this is a language issue, or something for the runtime
- whether swift should support this kind of process control
- what the consequences of this would be to the system in general
(including but not limited to the possibility of implementing a "virtual
data" thing with it and the ability to have reproducible experiments).
- whether there is a middle ground, such as isolating side-effects like
this (Ben would mention haskell and monads about here).
>
> I am wondering if I can use globus::maxwalltime as a timeout mechanism ?
maxwalltime applies to the actual job (not queue times) so it's worse
than a wrapper script, because as opposed to a wrapper script where you
can gracefully supply a dummy result, violating maxwalltime results in
an error.
> My current solution is to have a task run locally and the other one
> remotely, and to use the local tasks' timeout as a barrier to
> generating the dummy output or to validating the remote result as the
> proper output.
>
> I know I am pushing the limits here, that's what I pretty much do all
> the time with Swift.
I don't think this is a discussion about mechanisms, since for that
there already is a solution in karajan called "race" (a discriminator in
"workflow" terms) which (theoretically) takes care of the cleanup
including canceling the branches that lost and any jobs that they might
have launched.
From skenny at uchicago.edu Mon Jul 13 21:54:11 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Mon, 13 Jul 2009 21:54:11 -0500 (CDT)
Subject: [Swift-devel] Coasters and std's on ranger
Message-ID: <20090713215411.CAD90960@m4500-02.uchicago.edu>
so, here is the swift error i currently get running a 50-job
workflow with the latest code on ranger:
Execution failed:
Exception in RInvoke:
Arguments: [scripts/4reg_dummy.R,
matrices/4_reg/network1/gestspeech.cov, 31, 0.5, speech]
Host: RANGER
Directory:
4reg_speech-20090713-2127-tbl7ou0e/jobs/f/RInvoke-f57xpmdj
stderr.txt:
stdout.txt:
----
Caused by:
Block task failed:
org.globus.gram.GramException: The job manager could not stage
out a file
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
at org.globus.gram.GramJob.setStatus(GramJob.java:184)
at
org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
at java.lang.Thread.run(Thread.java:619)
Cleaning up...
Shutting down service at https://129.114.50.163:36721
Got channel MetaChannel: 24980848 -> GSSSChannel-null(1)
- Done
gram log shows this:
7/13 21:37:58 JM: sending callback of status 4 (failure code
155) to https://128.135.125.211:50003/1247538475621.
7/13 21:37:58 JMI: testing job manager scripts for type fork
exist and permissions are ok.
this is the same error i was getting on ranger running without
coasters prior to commenting out the redirection of stdout and
stderr (which corrected the error for provider-gt2). is there
a redirection of these std's going on in provider-coaster that
can be corrected somehow?
~sk
p.s. let me know if anyone would like the swift log for this.
---- Original message ----
>Date: Mon, 13 Jul 2009 18:00:33 -0500
>From: Michael Wilde
>Subject: Re: [Swift-devel] Coaster CPU-time consumption issue
>To: skenny at uchicago.edu
>Cc: Allan Espinosa , swift-devel
>
>OK. Best to test with Mihael's cog r2429.
>I hope that this ends the latest head-node havoc :)
>
>Please post either way, so we know if the other problem
remains or not.
>
>Thanks,
>
>- Mike
>
>
>On 7/13/09 5:41 PM, skenny at uchicago.edu wrote:
>> cool, i'll give this a shot now too.
>>
>> it's possible the other err i mentioned to you mike, was
>> actually related to the stdout redirection. i wanted to test
>> more, but trying not to wreak havoc on the headnode :P anyway,
>> if this works, i can do more testing and will post if i'm
>> still getting the error.
>>
>> ~sk
>>
>> ---- Original message ----
>>> Date: Mon, 13 Jul 2009 17:17:26 -0500
>>> From: Michael Wilde
>>> Subject: Re: [Swift-devel] Coaster CPU-time consumption
issue
>>> To: Allan Espinosa , Sarah Kenny
>>
>>> Cc: swift-devel
>>>
>>> Nice! Now lets beat it up and see how well it works.
>>>
>>> Sarah: Allan did not encounter the error messages you
>> mentioned to me.
>>> I suggest you do this:
>>>
>>> - post to the devel list the messages you got
>>>
>>> - test this patch to see if it clears up the problem
>>>
>>> Mike
>>>
>>>
>>> On 7/13/09 5:04 PM, Allan Espinosa wrote:
>>>> hi,
>>>>
>>>> here is a patch which solves the cpu usage on the bootstrap
>> coaster
>>>> service:
>>
http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>>>> suggested svn log entry:
>>>> Added locks via wait() and notify() to prevent busy
>> waiting/
>>>> active polling in the block task queue.
>>>>
>>>>
>>>> Test 2000 touch job using 066-many.swift via local:local :
>>>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
>>>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>>>>
>>>> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>>>>
>>>> -Allan
>>>>
>>>>
>>>> 2009/7/13 Michael Wilde :
>>>>> Hi Allan,
>>>>>
>>>>> I think the methods you want for synchronization are part
>> of class Object.
>>>>> They are documented in the chapter Threads and Locks of
>> The Java Language
>>>>> Specification:
>>>>>
>>>>>
>>
http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>>>>> queue.wait() should be called if the queue is empty.
>>>>>
>>>>> queue.notify() or .notifyall() should be called when
>> something is added to
>>>>> the queue. I think notify() should work.
>>>>>
>>>>> .wait will I think take a timer, but suspect you dont need
>> that.
>>>>> Both should be called within the synchronized(queue)
>> constructs that are
>>>>> already in the code.
>>>>>
>>>>> Should be fun to fix this!
>>>>>
>>>>> - Mike
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>>>>> 97% is an average as can be seen in run06. swift version
>> is r3005 and
>>>>>> cogkit r2410. this is a vanilla build of swift.
>>>>>>
>>>>>> 2009/7/13 Mihael Hategan :
>>>>>>> A while ago I committed a patch to run the service
>> process with a lower
>>>>>>> priority. Is that in use?
>>>>>>>
>>>>>>> Also, is logging reduced or is it the default?
>>>>>>>
>>>>>>> Is the 97% CPU usage a spike, or does it stay there on
>> average?
>>>>>>> Can I take a look at the coaster logs from skenny's run
>> on ranger?
>>>>>>> I'd also like to point out in as little offensive mode
>> as I can, that
>>>>>>> I'm working 100% on I2U2 and my lack of getting more
>> than lightly
>>>>>>> involved in this is a consequence of that.
>>>>>>>
>>>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored
>> tp-osg. From
>>>>>>>> here process 22395 is the child of the main java process
>>>>>>>> (bootstrap.jar) and is loading the CPU.
>>>>>>>>
>>>>>>>> I have coasters.log, worker-*log, swift logs, gram
logs in
>>>>>>>> ~aespinosa/workflows/activelog/run06. This refers to a
>> different run.
>>>>>>>> PID 15206 is the child java process of bootstrap.jar
>> in here.
>>>>>>>> top snapshot:
>>>>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load
>> average: 1.18, 0.80,
>>>>>>>> 0.55
>>>>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0
>> stopped, 0 zombie
>>>>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa,
>> 0.0%hi, 0.0%si,
>>>>>>>> 0.0%st
>>>>>>>> Mem: 4058916k total, 3889864k used, 169052k free,
>> 239688k buffers
>>>>>>>> Swap: 4192956k total, 96k used, 4192860k free,
>> 2504812k cached
>>>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM
>> TIME+ COMMAND
>>>>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3
>> 4:29.22 java
>>>>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0
>> 0:00.50 top
>>>>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1
>> 0:00.10
>>>>>>>> globus-job-mana
>>>>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0
>> 0:00.06 sshd
>>>>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1
>> 0:00.18 bash
>>>>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0
>> 0:00.00 bash
>>>>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0
>> 0:00.00 bash
>>>>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5
>> 0:00.20 java
>>>>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1
>> 0:00.09
>>>>>>>> globus-job-man
>>>>>>>>
>>>>>>>> ps snapshot:
>>>>>>>>
>>>>>>>> 22328 ? S 0:00 \_ /bin/bash
>>>>>>>> 22364 ? Sl 0:00 \_
>>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>> -DGLOBUS_TCP_PORT_RANGE=
>>>>>>>>
>>
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>>> -DX509_CERT_DIR=
>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>>>>>> /tmp/bootstrap.w22332
>> http://communicado.ci.uchicago.edu:46520
>>>>>>>> https://128.135.125.17:46519 11505253269
>>>>>>>> 22395 ? SNl 6:29 \_
>>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>>>>>
>>>>>>>>
>>
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>>>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>>>>>
>>>>>>>>
>>
/home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde
>>> c9
>>
46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.
>>> jar
>>
:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou
>>> s_s
>>
ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>>>>>
>>>>>>>> 2009/7/13 Mihael Hategan :
>>>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>>>>> At the time we did not have a chance to gather
>> detailed evidence,
>>>>>>>>>>>> but I
>>>>>>>>>>>> was surprised by two things:
>>>>>>>>>>>>
>>>>>>>>>>>> - that there were two Java processes and that one
>> was so big. (Are
>>>>>>>>>>>> most
>>>>>>>>>>>> likely the active process was just a child thread
>> of the main
>>>>>>>>>>>> process?)
>>>>>>>>>>> One java process is the bootstrap process (it
>> downloads the coaster
>>>>>>>>>>> jars, sets up the environment and runs the coaster
>> service). It has
>>>>>>>>>>> always been like this. Did you happen to capture the
>> output of ps to
>>>>>>>>>>> a
>>>>>>>>>>> file? That would be useful, because from what you
>> are suggesting, it
>>>>>>>>>>> appears that the bootstrap process is eating 100%
>> CPU. That process
>>>>>>>>>>> should only be sleeping after the service is started.
>>>>>>>>>> I *thought* I captured the output of "top -u
>> sarahs'id -b -d" but I
>>>>>>>>>> cant
>>>>>>>>>> locate it.
>>>>>>>>>>
>>>>>>>>>> As best as I can recall it showed the larger
>> memory-footprint process
>>>>>>>>>> to
>>>>>>>>>> be relatively idle, and the smaller footprint process
>> (about 275MB) to
>>>>>>>>>> be burning 100% of a CPU.
>>>>>>>>> Normally, the smaller footprint process should be the
>> bootstrap. But
>>>>>>>>> that's why I would like the ps output, because it
>> sounds odd.
>>>>>>>>>> Allan will try to get a snapshot of this shortly.
>>>>>>>>>>
>>>>>>>>>> If this observation if correct, whats the best way to
>> find out where
>>>>>>>>>> its
>>>>>>>>>> spinning? Profiling? Debug logging? Can you get
>> profiling data from a
>>>>>>>>>> JVM that doesnt exit?
>>>>>>>>> Once I know where it is, I can look at the code and
>> then we'll go from
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>>>
From hategan at mcs.anl.gov Mon Jul 13 22:05:51 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 22:05:51 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <20090713215411.CAD90960@m4500-02.uchicago.edu>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
Message-ID: <1247540751.30172.5.camel@localhost>
On Mon, 2009-07-13 at 21:54 -0500, skenny at uchicago.edu wrote:
[...]
> gram log shows this:
>
> 7/13 21:37:58 JM: sending callback of status 4 (failure code
> 155) to https://128.135.125.211:50003/1247538475621.
> 7/13 21:37:58 JMI: testing job manager scripts for type fork
> exist and permissions are ok.
>
> this is the same error i was getting on ranger running without
> coasters prior to commenting out the redirection of stdout and
> stderr (which corrected the error for provider-gt2).
I am afraid then that this is an incurable problem with the current SGE
job manager.
I think there are two ways of dealing with this:
1. Report the problem to the folks who developed the SGE job manager and
hope it will get fixed and deployed on ranger
2. Write a local SGE provider [/me ducks while Ian throws various
objects in my general direction]
From hategan at mcs.anl.gov Mon Jul 13 22:18:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 13 Jul 2009 22:18:58 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247540751.30172.5.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
Message-ID: <1247541538.30172.8.camel@localhost>
On Mon, 2009-07-13 at 22:05 -0500, Mihael Hategan wrote:
> On Mon, 2009-07-13 at 21:54 -0500, skenny at uchicago.edu wrote:
> [...]
> > gram log shows this:
> >
> > 7/13 21:37:58 JM: sending callback of status 4 (failure code
> > 155) to https://128.135.125.211:50003/1247538475621.
> > 7/13 21:37:58 JMI: testing job manager scripts for type fork
> > exist and permissions are ok.
> >
> > this is the same error i was getting on ranger running without
> > coasters prior to commenting out the redirection of stdout and
> > stderr (which corrected the error for provider-gt2).
>
> I am afraid then that this is an incurable problem with the current SGE
> job manager.
Or not...
I see that in the current coaster code the stdout of the block task is
always redirected.
Try cog r2430 and keep the commented lines commented in the gt2
provider.
From skenny at uchicago.edu Tue Jul 14 02:05:26 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 14 Jul 2009 02:05:26 -0500 (CDT)
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247541538.30172.8.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
Message-ID: <20090714020526.CAE04557@m4500-02.uchicago.edu>
>I see that in the current coaster code the stdout of the
block task is
>always redirected.
>
>Try cog r2430 and keep the commented lines commented in the gt2
>provider.
2009-07-14 01:14:30,525-0500 INFO unknown Swift svn
swift-r3005 cog-r2430 (cog modified locally)
Execution failed:
Exception in RInvoke:
Arguments: [scripts/4reg_dummy.R,
matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech]
Host: RANGER
Directory:
4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj
stderr.txt:
stdout.txt:
----
Caused by:
Block task failed: 0714-140152-000000Block task ended
prematurely
Progress: Submitted:18 Failed:16 Finished successfully:16
Cleaning up...
gram log:
7/14 01:25:44 JM: sending callback of status 4 (failure code
155) to https://128.135.125.211:50003/1247552072425.
7/14 01:25:44 JMI: testing job manager scripts for type fork
exist and permissions are ok.
From benc at hawaga.org.uk Tue Jul 14 02:09:16 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 14 Jul 2009 07:09:16 +0000 (GMT)
Subject: [Swift-devel] Functionality request: best effort execution
In-Reply-To:
References:
Message-ID:
One way of putting in ambiguity here is something like the AMB(iguous)
operator, which looks very similar to Karajan's race behaviour.
a AMB b evaluates to either a or b but its not defined which and so the
runtime can pick which.
That has no particular preference for a result, though in Tibi's use case
one of the results is probably preferred.
You could change the semantics so that it returns a unless a fails in
which case it evaluates and returns b, unless b fails in which case the
expression fails to evaluate.
Both of the above descriptions can be extended to more than two operands
in a natural way.
--
From bugzilla-daemon at mcs.anl.gov Tue Jul 14 06:29:00 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 14 Jul 2009 06:29:00 -0500 (CDT)
Subject: [Swift-devel] [Bug 210] job exceeding wallclock limit -- error is
not reported by swift
In-Reply-To:
References:
Message-ID: <20090714112900.40D302CB0F@wind.mcs.anl.gov>
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=210
--- Comment #1 from Ben Clifford 2009-07-14 06:29:00 ---
This bug is rather ambiguously described.
In non-bugzilla discussion it has been reported as:
> well, for some reason, when a job hits wallclock and is killed by the JM, swift just keeps saying "active"
This is not behaviour that I observe with Swift against NCSA using the below
swiftscript and configuration using Swift swift-r3006 cog-r2430 - in such case,
I see the job fail three times in a row and then the example SwiftScript fails
as should happen.
Please clarify this bug.
s.swift:
$ cat s.swift
type messagefile;
app (messagefile t) greeting() {
sleep "999s" stdout=@filename(t);
}
messagefile outfile <"hello.txt">;
outfile = greeting();
tc.data:
$ cat tc.data
cat: tc.data: No such file or directory
benc at communicado:~/tmp-walltime/cog/modules/swift !1055
$ cat dist/swift-svn/etc/tc.data
#This is the transformation catalog.
#
#It comes pre-configured with a number of simple transformations with
#paths that are likely to work on a linux box. However, on some systems,
#the paths to these executables will be different (for example, sometimes
#some of these programs are found in /usr/bin rather than in /bin)
#
#NOTE WELL: fields in this file must be separated by tabs, not spaces; and
#there must be no trailing whitespace at the end of each line.
#
# sitename transformation path INSTALLED platform profiles
hg echo /bin/echo INSTALLED INTEL32::LINUX null
hg cat /bin/cat INSTALLED INTEL32::LINUX null
hg ls /bin/ls INSTALLED INTEL32::LINUX null
hg grep /bin/grep INSTALLED INTEL32::LINUX null
hg sort /bin/sort INSTALLED INTEL32::LINUX null
hg sleep /bin/sleep INSTALLED INTEL32::LINUX null
site definition:
/home/ac/benc
debug
1
the output:
Swift svn swift-r3006 cog-r2430
RunID: 20090714-0616-dgktv8b3
Progress:
Progress: Stage in:1
Progress: Submitted:1
Progress: Submitted:1
Progress: Submitted:1
Progress: Active:1
Progress: Active:1
Progress: Active:1
Progress: Active:1
Progress: Checking status:1
Progress: Stage in:1
Progress: Submitted:1
Progress: Submitted:1
Progress: Active:1
Progress: Active:1
Progress: Active:1
Progress: Checking status:1
Progress: Submitted:1
Progress: Submitted:1
Progress: Submitted:1
Progress: Active:1
Progress: Active:1
Progress: Active:1
Progress: Checking status:1
Execution failed:
Exception in sleep:
Arguments: [999s]
Host: hg
Directory: s-20090714-0616-dgktv8b3/jobs/8/sleep-8h82cndj
stderr.txt:
stdout.txt:
----
Caused by:
No status file was found. Check the shared filesystem on hg
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.
From wilde at mcs.anl.gov Tue Jul 14 08:30:22 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 14 Jul 2009 08:30:22 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <20090714020526.CAE04557@m4500-02.uchicago.edu>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
Message-ID: <4A5C886E.8070300@mcs.anl.gov>
Sarah, if its still broken on Thu, I will look at it then.
I assume it happens on single job runs as well.
Can you create a simple 1-job test directory on Ranger that I can copy
to reproduce the problem?
Ben, if you can solve this, this week, that would be great.
Else Allan and I will look at it; guidance welcome.
Thanks,
Mike
On 7/14/09 2:05 AM, skenny at uchicago.edu wrote:
>> I see that in the current coaster code the stdout of the
> block task is
>> always redirected.
>>
>> Try cog r2430 and keep the commented lines commented in the gt2
>> provider.
>
> 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn
> swift-r3005 cog-r2430 (cog modified locally)
>
> Execution failed:
> Exception in RInvoke:
> Arguments: [scripts/4reg_dummy.R,
> matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech]
> Host: RANGER
> Directory:
> 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Block task failed: 0714-140152-000000Block task ended
> prematurely
>
> Progress: Submitted:18 Failed:16 Finished successfully:16
> Cleaning up...
>
> gram log:
>
> 7/14 01:25:44 JM: sending callback of status 4 (failure code
> 155) to https://128.135.125.211:50003/1247552072425.
> 7/14 01:25:44 JMI: testing job manager scripts for type fork
> exist and permissions are ok.
>
>
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Jul 14 09:59:01 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 14 Jul 2009 09:59:01 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <20090714020526.CAE04557@m4500-02.uchicago.edu>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
Message-ID: <1247583541.1437.1.camel@localhost>
I see. What happens is that redirection hasn't been fixed in SGE, but
the commenting out of it in the gt2 provider did nothing because it was
enabled in the coaster provider.
There is one more thing to try, and that is to re-direct to a remote
file, hoping it won't hit whatever problem it hits now.
On Tue, 2009-07-14 at 02:05 -0500, skenny at uchicago.edu wrote:
> >I see that in the current coaster code the stdout of the
> block task is
> >always redirected.
> >
> >Try cog r2430 and keep the commented lines commented in the gt2
> >provider.
>
> 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn
> swift-r3005 cog-r2430 (cog modified locally)
>
> Execution failed:
> Exception in RInvoke:
> Arguments: [scripts/4reg_dummy.R,
> matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech]
> Host: RANGER
> Directory:
> 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Block task failed: 0714-140152-000000Block task ended
> prematurely
>
> Progress: Submitted:18 Failed:16 Finished successfully:16
> Cleaning up...
>
> gram log:
>
> 7/14 01:25:44 JM: sending callback of status 4 (failure code
> 155) to https://128.135.125.211:50003/1247552072425.
> 7/14 01:25:44 JMI: testing job manager scripts for type fork
> exist and permissions are ok.
>
>
>
>
>
From skenny at uchicago.edu Tue Jul 14 10:11:40 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 14 Jul 2009 10:11:40 -0500 (CDT)
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247583541.1437.1.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
Message-ID: <20090714101140.CAE36286@m4500-02.uchicago.edu>
>I see. What happens is that redirection hasn't been fixed in
SGE, but
>the commenting out of it in the gt2 provider did nothing
because it was
>enabled in the coaster provider.
right, i must've misunderstood, i had commented out
redirection for the gt2 provider so swift would work for
running w/o coasters, but i thought you were saying cog r2430
would be also redirecting for coasters as well...but
apparently you were trying a different change?
>There is one more thing to try, and that is to re-direct to a
remote
>file, hoping it won't hit whatever problem it hits now.
so, can you tell me where in the code i can redirect std's for
coasters? or, are you saying something else? :P
>On Tue, 2009-07-14 at 02:05 -0500, skenny at uchicago.edu wrote:
>> >I see that in the current coaster code the stdout of the
>> block task is
>> >always redirected.
>> >
>> >Try cog r2430 and keep the commented lines commented in
the gt2
>> >provider.
>>
>> 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn
>> swift-r3005 cog-r2430 (cog modified locally)
>>
>> Execution failed:
>> Exception in RInvoke:
>> Arguments: [scripts/4reg_dummy.R,
>> matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech]
>> Host: RANGER
>> Directory:
>> 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Block task failed: 0714-140152-000000Block task ended
>> prematurely
>>
>> Progress: Submitted:18 Failed:16 Finished successfully:16
>> Cleaning up...
>>
>> gram log:
>>
>> 7/14 01:25:44 JM: sending callback of status 4 (failure code
>> 155) to https://128.135.125.211:50003/1247552072425.
>> 7/14 01:25:44 JMI: testing job manager scripts for type fork
>> exist and permissions are ok.
>>
>>
>>
>>
>>
>
From hategan at mcs.anl.gov Tue Jul 14 10:22:26 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 14 Jul 2009 10:22:26 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <20090714101140.CAE36286@m4500-02.uchicago.edu>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
Message-ID: <1247584946.1759.8.camel@localhost>
On Tue, 2009-07-14 at 10:11 -0500, skenny at uchicago.edu wrote:
> >I see. What happens is that redirection hasn't been fixed in
> SGE, but
> >the commenting out of it in the gt2 provider did nothing
> because it was
> >enabled in the coaster provider.
>
> right, i must've misunderstood, i had commented out
> redirection for the gt2 provider so swift would work for
> running w/o coasters, but i thought you were saying cog r2430
> would be also redirecting for coasters as well...but
> apparently you were trying a different change?
No. As I was mentioning the commenting out I forgot that the output is
redirected anyway by the coaster code. So our little experiment then did
nothing.
Cog r2430 removed the explicit redirection in the coaster code. Without
that and without the hack to always redirect for SGE in the gt2 provider
that you commented out, there was no more redirection, so the SGE job
manager bug surfaced.
In cog r2431, there's redirection to a file. Do keep the lines commented
in the gt2 provider. I'm not sure how that will work out, but please try
and let me know.
From skenny at uchicago.edu Tue Jul 14 12:38:36 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 14 Jul 2009 12:38:36 -0500 (CDT)
Subject: [Swift-devel] [Bug 210] job exceeding
wallclock limit -- error is not reported by swift
In-Reply-To: <20090714112900.40D302CB0F@wind.mcs.anl.gov>
References:
<20090714112900.40D302CB0F@wind.mcs.anl.gov>
Message-ID: <20090714123836.CAE58724@m4500-02.uchicago.edu>
can you try resubmitting your test to ranger?
---- Original message ----
>Date: Tue, 14 Jul 2009 06:29:00 -0500 (CDT)
>From: bugzilla-daemon at mcs.anl.gov
>Subject: [Swift-devel] [Bug 210] job exceeding wallclock
limit -- error is not reported by swift
>To: swift-devel at ci.uchicago.edu
>
>https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=210
>
>
>
>
>
>--- Comment #1 from Ben Clifford
2009-07-14 06:29:00 ---
>This bug is rather ambiguously described.
>
>In non-bugzilla discussion it has been reported as:
>
>> well, for some reason, when a job hits wallclock and is
killed by the JM, swift just keeps saying "active"
>
>This is not behaviour that I observe with Swift against NCSA
using the below
>swiftscript and configuration using Swift swift-r3006
cog-r2430 - in such case,
>I see the job fail three times in a row and then the example
SwiftScript fails
>as should happen.
>
>Please clarify this bug.
>
>s.swift:
>
>$ cat s.swift
>type messagefile;
>
>app (messagefile t) greeting() {
> sleep "999s" stdout=@filename(t);
>}
>
>messagefile outfile <"hello.txt">;
>
>outfile = greeting();
>
>
>
>tc.data:
>
>$ cat tc.data
>cat: tc.data: No such file or directory
>benc at communicado:~/tmp-walltime/cog/modules/swift !1055
>$ cat dist/swift-svn/etc/tc.data
>#This is the transformation catalog.
>#
>#It comes pre-configured with a number of simple
transformations with
>#paths that are likely to work on a linux box. However, on
some systems,
>#the paths to these executables will be different (for
example, sometimes
>#some of these programs are found in /usr/bin rather than in
/bin)
>#
>#NOTE WELL: fields in this file must be separated by tabs,
not spaces; and
>#there must be no trailing whitespace at the end of each line.
>#
># sitename transformation path INSTALLED platform profiles
>hg echo /bin/echo INSTALLED INTEL32::LINUX
null
>hg cat /bin/cat INSTALLED INTEL32::LINUX
null
>hg ls /bin/ls INSTALLED INTEL32::LINUX
null
>hg grep /bin/grep INSTALLED INTEL32::LINUX
null
>hg sort /bin/sort INSTALLED INTEL32::LINUX
null
>hg sleep /bin/sleep INSTALLED
INTEL32::LINUX null
>
>
>site definition:
>
>
>
> url="grid-hg.ncsa.teragrid.org/jobmanager-pbs
>" major="2" />
> /home/ac/benc
> debug
> 1
>
>
>
>the output:
>
>Swift svn swift-r3006 cog-r2430
>
>RunID: 20090714-0616-dgktv8b3
>Progress:
>Progress: Stage in:1
>Progress: Submitted:1
>Progress: Submitted:1
>Progress: Submitted:1
>Progress: Active:1
>Progress: Active:1
>Progress: Active:1
>Progress: Active:1
>Progress: Checking status:1
>Progress: Stage in:1
>Progress: Submitted:1
>Progress: Submitted:1
>Progress: Active:1
>Progress: Active:1
>Progress: Active:1
>Progress: Checking status:1
>Progress: Submitted:1
>Progress: Submitted:1
>Progress: Submitted:1
>Progress: Active:1
>Progress: Active:1
>Progress: Active:1
>Progress: Checking status:1
>Execution failed:
> Exception in sleep:
>Arguments: [999s]
>Host: hg
>Directory: s-20090714-0616-dgktv8b3/jobs/8/sleep-8h82cndj
>stderr.txt:
>stdout.txt:
>----
>
>Caused by:
> No status file was found. Check the shared filesystem on hg
>
>--
>Configure bugmail:
https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
>------- You are receiving this mail because: -------
>You are watching the assignee of the bug.
>You are watching someone on the CC list of the bug.
>_______________________________________________
>Swift-devel mailing list
>Swift-devel at ci.uchicago.edu
>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From skenny at uchicago.edu Tue Jul 14 13:22:26 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 14 Jul 2009 13:22:26 -0500 (CDT)
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247584946.1759.8.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
Message-ID: <20090714132226.CAE63545@m4500-02.uchicago.edu>
darn...
Execution failed:
Exception in RInvoke:
Arguments: [scripts/4reg_dummy.R,
matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
Host: RANGER
Directory:
4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
stderr.txt:
stdout.txt:
----
Caused by:
Block task failed: 0714-090151-000000Block task ended
prematurely
Cleaning up...
Shutting down service at https://129.114.50.163:38571
i can file a bug report with TG if need be, but i'm not quite
sure the best thing to tell them (?) also, i'm wondering how
coasters was previously able to work around this bug?
~sk
---- Original message ----
>Date: Tue, 14 Jul 2009 10:22:26 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] Coasters and std's on ranger
>To: skenny at uchicago.edu
>Cc: swift-devel
>
>On Tue, 2009-07-14 at 10:11 -0500, skenny at uchicago.edu wrote:
>> >I see. What happens is that redirection hasn't been fixed in
>> SGE, but
>> >the commenting out of it in the gt2 provider did nothing
>> because it was
>> >enabled in the coaster provider.
>>
>> right, i must've misunderstood, i had commented out
>> redirection for the gt2 provider so swift would work for
>> running w/o coasters, but i thought you were saying cog r2430
>> would be also redirecting for coasters as well...but
>> apparently you were trying a different change?
>
>No. As I was mentioning the commenting out I forgot that the
output is
>redirected anyway by the coaster code. So our little
experiment then did
>nothing.
>
>Cog r2430 removed the explicit redirection in the coaster
code. Without
>that and without the hack to always redirect for SGE in the
gt2 provider
>that you commented out, there was no more redirection, so the
SGE job
>manager bug surfaced.
>
>In cog r2431, there's redirection to a file. Do keep the
lines commented
>in the gt2 provider. I'm not sure how that will work out, but
please try
>and let me know.
>
From hategan at mcs.anl.gov Tue Jul 14 14:21:57 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 14 Jul 2009 14:21:57 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <20090714132226.CAE63545@m4500-02.uchicago.edu>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
Message-ID: <1247599317.7032.0.camel@localhost>
On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
> darn...
>
> Execution failed:
> Exception in RInvoke:
> Arguments: [scripts/4reg_dummy.R,
> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
> Host: RANGER
> Directory:
> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Block task failed: 0714-090151-000000Block task ended
> prematurely
>
> Cleaning up...
> Shutting down service at https://129.114.50.163:38571
>
> i can file a bug report with TG if need be, but i'm not quite
> sure the best thing to tell them (?) also, i'm wondering how
> coasters was previously able to work around this bug?
By redirecting stdout+stderr to memory, but that causes the "job manager
could not stage out a file" problem.
From wilde at mcs.anl.gov Tue Jul 14 14:29:00 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 14 Jul 2009 14:29:00 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247599317.7032.0.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost>
Message-ID: <4A5CDC7C.6070304@mcs.anl.gov>
Will the current code work for swift programs that dont use stdout or
stderr? (Ie where the app wrappers redirect these to a file?)
- Mike
On 7/14/09 2:21 PM, Mihael Hategan wrote:
> On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
>> darn...
>>
>> Execution failed:
>> Exception in RInvoke:
>> Arguments: [scripts/4reg_dummy.R,
>> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
>> Host: RANGER
>> Directory:
>> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Block task failed: 0714-090151-000000Block task ended
>> prematurely
>>
>> Cleaning up...
>> Shutting down service at https://129.114.50.163:38571
>>
>> i can file a bug report with TG if need be, but i'm not quite
>> sure the best thing to tell them (?) also, i'm wondering how
>> coasters was previously able to work around this bug?
>
> By redirecting stdout+stderr to memory, but that causes the "job manager
> could not stage out a file" problem.
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Jul 14 14:33:33 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 14 Jul 2009 14:33:33 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <4A5CDC7C.6070304@mcs.anl.gov>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost> <4A5CDC7C.6070304@mcs.anl.gov>
Message-ID: <1247600013.7032.13.camel@localhost>
This isn't the app stdout/stderr, but the job stdout/stderr. They are
redirected in coasters for debugging/accounting purposes, and with SGE
because the [censored] thing doesn't work otherwise.
On Tue, 2009-07-14 at 14:29 -0500, Michael Wilde wrote:
> Will the current code work for swift programs that dont use stdout or
> stderr? (Ie where the app wrappers redirect these to a file?)
>
> - Mike
>
> On 7/14/09 2:21 PM, Mihael Hategan wrote:
> > On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
> >> darn...
> >>
> >> Execution failed:
> >> Exception in RInvoke:
> >> Arguments: [scripts/4reg_dummy.R,
> >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
> >> Host: RANGER
> >> Directory:
> >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
> >> stderr.txt:
> >>
> >> stdout.txt:
> >>
> >> ----
> >>
> >> Caused by:
> >> Block task failed: 0714-090151-000000Block task ended
> >> prematurely
> >>
> >> Cleaning up...
> >> Shutting down service at https://129.114.50.163:38571
> >>
> >> i can file a bug report with TG if need be, but i'm not quite
> >> sure the best thing to tell them (?) also, i'm wondering how
> >> coasters was previously able to work around this bug?
> >
> > By redirecting stdout+stderr to memory, but that causes the "job manager
> > could not stage out a file" problem.
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Jul 14 14:46:17 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 14 Jul 2009 14:46:17 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247600013.7032.13.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost> <4A5CDC7C.6070304@mcs.anl.gov>
<1247600013.7032.13.camel@localhost>
Message-ID: <4A5CE089.40603@mcs.anl.gov>
So are any of the following reasonable ways to proceed?
1) Develop an SGE provider (hopefully heavily based on the PBS provider)
and run on Ranger locally.
2) Debug getting Coasters, GRAM and SGE to coexist nicely (ie the
debugging route in progress now)
3) Start the coaster service manually in one block allocation and have
it rendezvous with Swift
For (2) can we create a GRAM test job outside of Swift that we can
debug, to try to find a set of GRAM options that work? I need to read
the thread more carefully, but I dont understand if the problem is in
Ranger SGE, the GRAM SGE jobmanager, or the interaction between them.
I'll re-read the thread first before asking for more clarification; I
didnt get it on first read.
- Mike
On 7/14/09 2:33 PM, Mihael Hategan wrote:
> This isn't the app stdout/stderr, but the job stdout/stderr. They are
> redirected in coasters for debugging/accounting purposes, and with SGE
> because the [censored] thing doesn't work otherwise.
>
> On Tue, 2009-07-14 at 14:29 -0500, Michael Wilde wrote:
>> Will the current code work for swift programs that dont use stdout or
>> stderr? (Ie where the app wrappers redirect these to a file?)
>>
>> - Mike
>>
>> On 7/14/09 2:21 PM, Mihael Hategan wrote:
>>> On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
>>>> darn...
>>>>
>>>> Execution failed:
>>>> Exception in RInvoke:
>>>> Arguments: [scripts/4reg_dummy.R,
>>>> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
>>>> Host: RANGER
>>>> Directory:
>>>> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>> Block task failed: 0714-090151-000000Block task ended
>>>> prematurely
>>>>
>>>> Cleaning up...
>>>> Shutting down service at https://129.114.50.163:38571
>>>>
>>>> i can file a bug report with TG if need be, but i'm not quite
>>>> sure the best thing to tell them (?) also, i'm wondering how
>>>> coasters was previously able to work around this bug?
>>> By redirecting stdout+stderr to memory, but that causes the "job manager
>>> could not stage out a file" problem.
>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Tue Jul 14 14:53:05 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 14 Jul 2009 14:53:05 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <4A5CE089.40603@mcs.anl.gov>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost> <4A5CDC7C.6070304@mcs.anl.gov>
<1247600013.7032.13.camel@localhost> <4A5CE089.40603@mcs.anl.gov>
Message-ID: <1247601185.7638.3.camel@localhost>
On Tue, 2009-07-14 at 14:46 -0500, Michael Wilde wrote:
> So are any of the following reasonable ways to proceed?
>
> 1) Develop an SGE provider (hopefully heavily based on the PBS provider)
> and run on Ranger locally.
>
> 2) Debug getting Coasters, GRAM and SGE to coexist nicely (ie the
> debugging route in progress now)
Yeah. I mentioned those two yesterday.
>
> 3) Start the coaster service manually in one block allocation and have
> it rendezvous with Swift
Possible. You could also force the current one to allocate a single
block or even ignore the stageout error because it occurs after a block
is done.
>
> For (2) can we create a GRAM test job outside of Swift that we can
> debug, to try to find a set of GRAM options that work?
You're welcome to try. I haven't been able to do it so far.
From skenny at uchicago.edu Tue Jul 14 16:57:40 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 14 Jul 2009 16:57:40 -0500 (CDT)
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247599317.7032.0.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost>
Message-ID: <20090714165740.CAE89638@m4500-02.uchicago.edu>
---- Original message ----
>Date: Tue, 14 Jul 2009 14:21:57 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] Coasters and std's on ranger
>To: skenny at uchicago.edu
>Cc: swift-devel
>
>On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
>> darn...
>>
>> Execution failed:
>> Exception in RInvoke:
>> Arguments: [scripts/4reg_dummy.R,
>> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
>> Host: RANGER
>> Directory:
>> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Block task failed: 0714-090151-000000Block task ended
>> prematurely
>>
>> Cleaning up...
>> Shutting down service at https://129.114.50.163:38571
>>
>> i can file a bug report with TG if need be, but i'm not quite
>> sure the best thing to tell them (?) also, i'm wondering how
>> coasters was previously able to work around this bug?
>
>By redirecting stdout+stderr to memory, but that causes the
"job manager
>could not stage out a file" problem.
actually, i meant (way back when this worked for me :) prior
to any of the redirection (circa swift stable release
0.8ish)...but perhaps that's assuming the sge bug existed then
as well...
From hategan at mcs.anl.gov Tue Jul 14 17:16:38 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 14 Jul 2009 17:16:38 -0500
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <20090714165740.CAE89638@m4500-02.uchicago.edu>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost>
<20090714165740.CAE89638@m4500-02.uchicago.edu>
Message-ID: <1247609798.10133.1.camel@localhost>
On Tue, 2009-07-14 at 16:57 -0500, skenny at uchicago.edu wrote:
> ---- Original message ----
> >Date: Tue, 14 Jul 2009 14:21:57 -0500
> >From: Mihael Hategan
> >Subject: Re: [Swift-devel] Coasters and std's on ranger
> >To: skenny at uchicago.edu
> >Cc: swift-devel
> >
> >On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
> >> darn...
> >>
> >> Execution failed:
> >> Exception in RInvoke:
> >> Arguments: [scripts/4reg_dummy.R,
> >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
> >> Host: RANGER
> >> Directory:
> >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
> >> stderr.txt:
> >>
> >> stdout.txt:
> >>
> >> ----
> >>
> >> Caused by:
> >> Block task failed: 0714-090151-000000Block task ended
> >> prematurely
> >>
> >> Cleaning up...
> >> Shutting down service at https://129.114.50.163:38571
> >>
> >> i can file a bug report with TG if need be, but i'm not quite
> >> sure the best thing to tell them (?) also, i'm wondering how
> >> coasters was previously able to work around this bug?
> >
> >By redirecting stdout+stderr to memory, but that causes the
> "job manager
> >could not stage out a file" problem.
>
> actually, i meant (way back when this worked for me :) prior
> to any of the redirection (circa swift stable release
> 0.8ish)...but perhaps that's assuming the sge bug existed then
> as well...
Yes, it did. But due to the way the coasters worked at the time, the
error was ignored. I can make it such that this is the case again.
From skenny at uchicago.edu Tue Jul 14 17:22:51 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 14 Jul 2009 17:22:51 -0500 (CDT)
Subject: [Swift-devel] Coasters and std's on ranger
In-Reply-To: <1247609798.10133.1.camel@localhost>
References: <20090713215411.CAD90960@m4500-02.uchicago.edu>
<1247540751.30172.5.camel@localhost>
<1247541538.30172.8.camel@localhost>
<20090714020526.CAE04557@m4500-02.uchicago.edu>
<1247583541.1437.1.camel@localhost>
<20090714101140.CAE36286@m4500-02.uchicago.edu>
<1247584946.1759.8.camel@localhost>
<20090714132226.CAE63545@m4500-02.uchicago.edu>
<1247599317.7032.0.camel@localhost>
<20090714165740.CAE89638@m4500-02.uchicago.edu>
<1247609798.10133.1.camel@localhost>
Message-ID: <20090714172251.CAE92317@m4500-02.uchicago.edu>
---- Original message ----
>Date: Tue, 14 Jul 2009 17:16:38 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] Coasters and std's on ranger
>To: skenny at uchicago.edu
>Cc: swift-devel
>
>On Tue, 2009-07-14 at 16:57 -0500, skenny at uchicago.edu wrote:
>> ---- Original message ----
>> >Date: Tue, 14 Jul 2009 14:21:57 -0500
>> >From: Mihael Hategan
>> >Subject: Re: [Swift-devel] Coasters and std's on ranger
>> >To: skenny at uchicago.edu
>> >Cc: swift-devel
>> >
>> >On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote:
>> >> darn...
>> >>
>> >> Execution failed:
>> >> Exception in RInvoke:
>> >> Arguments: [scripts/4reg_dummy.R,
>> >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech]
>> >> Host: RANGER
>> >> Directory:
>> >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj
>> >> stderr.txt:
>> >>
>> >> stdout.txt:
>> >>
>> >> ----
>> >>
>> >> Caused by:
>> >> Block task failed: 0714-090151-000000Block task
ended
>> >> prematurely
>> >>
>> >> Cleaning up...
>> >> Shutting down service at https://129.114.50.163:38571
>> >>
>> >> i can file a bug report with TG if need be, but i'm not
quite
>> >> sure the best thing to tell them (?) also, i'm wondering how
>> >> coasters was previously able to work around this bug?
>> >
>> >By redirecting stdout+stderr to memory, but that causes the
>> "job manager
>> >could not stage out a file" problem.
>>
>> actually, i meant (way back when this worked for me :) prior
>> to any of the redirection (circa swift stable release
>> 0.8ish)...but perhaps that's assuming the sge bug existed then
>> as well...
>
>Yes, it did. But due to the way the coasters worked at the
time, the
>error was ignored. I can make it such that this is the case
again.
>
sounds good to me.
From benc at hawaga.org.uk Wed Jul 15 03:36:23 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 15 Jul 2009 08:36:23 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-commit] r3008 - SwiftApps/SEE/trunk
In-Reply-To: <20090715020426.227839CCC4@vm-125-59.ci.uchicago.edu>
References: <20090715020426.227839CCC4@vm-125-59.ci.uchicago.edu>
Message-ID:
You should file a bug describing this.
On Tue, 14 Jul 2009, noreply at vm-125-59.ci.uchicago.edu wrote:
> Author: aespinosa
> Date: 2009-07-14 21:04:25 -0500 (Tue, 14 Jul 2009)
> New Revision: 3008
>
> Modified:
> SwiftApps/SEE/trunk/instance_mapper.sh
> Log:
> swift code now compiles with struct hack from ben
>
> Modified: SwiftApps/SEE/trunk/instance_mapper.sh
> ===================================================================
> --- SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-14 22:06:08 UTC (rev 3007)
> +++ SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-15 02:04:25 UTC (rev 3008)
> @@ -15,6 +15,7 @@
>
> echo "ofile result/$instance/stdout";
>
> +echo "out null";
> echo "out.expend_out result/$instance/expend.out";
> echo "out.price_out result/$instance/price.out";
> echo "out.ratio_out result/$instance/ratio.out";
>
> _______________________________________________
> Swift-commit mailing list
> Swift-commit at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-commit
>
>
From bugzilla-daemon at mcs.anl.gov Wed Jul 15 11:45:33 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 15 Jul 2009 11:45:33 -0500 (CDT)
Subject: [Swift-devel] [Bug 217] New: struct of structs via ext mapper
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=217
Summary: struct of structs via ext mapper
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: aespinosa at cs.uchicago.edu
Swift is looking for a file describing the 2nd level struct itself.
my swift session
(latest on cog svn and swift svn) reports as follows:
RunID: testing
Progress:
Progress: Initializing site shared directory:1 Failed:1
Execution failed:
Mapper failed to map org.griphyn.vdl.mapping.DataNode
identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090714-1343-6lzjg014:720000000039
type AmplFilter with no value at dataset=res path=.out (not closed)
my instance_mapper.sh:
#!/bin/bash
while getopts ":i:" options; do
case $options in
i) export instance=$OPTARG ;;
*) exit 1;;
esac
done
echo "expend result/$instance/expend.dat";
echo "limits result/$instance/limits.dat";
echo "price result/$instance/price.dat";
echo "ratio result/$instance/ratio.dat";
echo "solve result/$instance/solve.dat";
echo "ofile result/$instance/stdout";
echo "out.expend_out result/$instance/expend.out";
echo "out.price_out result/$instance/price.out";
echo "out.ratio_out result/$instance/ratio.out";
here is the workflow i was working on:
type Template;
type AmplIn;
type StdOut;
type AmplCmd {
Template temp;
AmplIn mod;
AmplIn process;
AmplIn output;
AmplIn so;
AmplIn tree;
}
type ExpendDat;
type LimitsDat;
type PriceDat;
type RatioDat;
type SolveDat;
type ExpendOut;
type PriceOut;
type RatioOut;
type AmplFilter {
ExpendOut expend_out;
PriceOut price_out;
RatioOut ratio_out;
}
type AmplResult {
ExpendDat expend;
LimitsDat limits;
PriceDat price;
RatioDat ratio;
SolveDat solve;
StdOut ofile;
AmplFilter out;
}
app (AmplResult result) run_ampl (string instanceID, AmplCmd cmd)
{
run_ampl instanceID @filename(cmd.temp)
@filename(cmd.mod) @filename(cmd.process)
@filename(cmd.output) @filename(cmd.so) @filename(cmd.tree)
stdout=@filename(result.ofile);
}
AmplCmd const_cmd ;
int runs[]=[2001:2002];
foreach i in runs {
string instanceID = @strcat("run", i);
AmplResult res ;
res = run_ampl(instanceID, const_cmd);
}
Initial hack to get the script to work:
> Author: aespinosa
> Date: 2009-07-14 21:04:25 -0500 (Tue, 14 Jul 2009)
> New Revision: 3008
>
> Modified:
> SwiftApps/SEE/trunk/instance_mapper.sh
> Log:
> swift code now compiles with struct hack from ben
>
> Modified: SwiftApps/SEE/trunk/instance_mapper.sh
> ===================================================================
> --- SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-14 22:06:08 UTC (rev 3007)
> +++ SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-15 02:04:25 UTC (rev 3008)
> @@ -15,6 +15,7 @@
>
> echo "ofile result/$instance/stdout";
>
> +echo "out null";
> echo "out.expend_out result/$instance/expend.out";
> echo "out.price_out result/$instance/price.out";
> echo "out.ratio_out result/$instance/ratio.out";
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
From iraicu at cs.uchicago.edu Wed Jul 15 14:04:49 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 15 Jul 2009 14:04:49 -0500
Subject: [Swift-devel] CFP: 2nd ACM Workshop on Many-Task Computing on Grids
and Supercomputers (MTAGS09) at Supercomputing 2009
Message-ID: <4A5E2851.1000500@cs.uchicago.edu>
Call for Papers
---------------------------------------------------------------------------------------
The 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers
(MTAGS) 2009
http://dsl.cs.uchicago.edu/MTAGS09/
---------------------------------------------------------------------------------------
November 16th, 2009
Portland, Oregon, USA
Co-located with with IEEE/ACM International Conference for
High Performance Computing, Networking, Storage and Analysis (SC09)
=======================================================================================
The 2nd workshop on Many-Task Computing on Grids and Supercomputers
(MTAGS) will
provide the scientific community a dedicated forum for presenting new
research,
development, and deployment efforts of loosely coupled large scale
applications on
large scale clusters, Grids, Supercomputers, and Cloud Computing
infrastructure.
Many-task computing (MTC), the theme of the workshop encompasses loosely
coupled
applications, which are generally composed of many tasks (both
independent and
dependent tasks) to achieve some larger application goal. This workshop
will cover
challenges that can hamper efficiency and utilization in running
applications on
large-scale systems, such as local resource manager scalability and
granularity,
efficient utilization of the raw hardware, parallel file system
contention and
scalability, reliability at scale, and application scalability. We
welcome paper
submissions on all topics related to MTC on large scale systems. Papers
will be
peer-reviewed, and accepted papers will be published in the workshop
proceedings as
part of the ACM digital library. The workshop will be co-located with the
IEEE/ACM Supercomputing 2009 Conference in Portland Oregon on November
16th, 2009.
For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/.
Scope
---------------------------------------------------------------------------------------
This workshop will focus on the ability to manage and execute large
scale applications
on today's largest clusters, Grids, and Supercomputers. Clusters with
50K+ processor
cores are beginning to come online (i.e. TACC Sun Constellation System -
Ranger), Grids
(i.e. TeraGrid) with a dozen sites and 100K+ processors, and
supercomputers with 160K
processors (i.e. IBM BlueGene/P). Large clusters and supercomputers have
traditionally
been high performance computing (HPC) systems, as they are efficient at
executing
tightly coupled parallel jobs within a particular machine with low-latency
interconnects; the applications typically use message passing interface
(MPI) to
achieve the needed inter-process communication. On the other hand, Grids
have been the
preferred platform for more loosely coupled applications that tend to be
managed and
executed through workflow systems. In contrast to HPC (tightly coupled
applications),
these loosely coupled applications make up a new class of applications
as what we call
Many-Task Computing (MTC). MTC systems generally involve the execution
of independent,
sequential jobs that can be individually scheduled on many different
computing
resources across multiple administrative boundaries. MTC systems
typically achieve this
using various grid computing technologies and techniques, and often
times use files to
achieve the inter-process communication as alternative communication
mechanisms than
MPI. MTC is reminiscent to High Throughput Computing (HTC); however, MTC
differs from
HTC in the emphasis of using many computing resources over short periods
of time to
accomplish many computational tasks, where the primary metrics are
measured in seconds
(e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the other hand requires
large amounts
of computing for longer times (months and years, rather than hours and
days, and are
generally measured in operations per month).
Today's existing HPC systems are a viable platform to host MTC
applications. However,
some challenges arise in large scale applications when run on large
scale systems,
which can hamper the efficiency and utilization of these large scale
systems. These
challenges vary from local resource manager scalability and granularity,
efficient
utilization of the raw hardware, shared file system contention and
scalability,
reliability at scale, application scalability, and understanding the
limitations of the
HPC systems in order to identify good candidate MTC applications.
Furthermore, the MTC
paradigm can be naturally applied to the emerging Cloud Computing
paradigm due to its
loosely coupled nature, which is being adopted by industry as the next
wave of
technological advancement to reduce operational costs while improving
efficiencies in
large scale infrastructures.
For an interesting discussion in a blog by Ian Foster on the difference
between MTC and
HTC, please see his blog
athttp://ianfoster.typepad.com/blog/2008/07/many-tasks-comp.html.
We also published two papers that are highly relevant to this workshop.
One paper is
titled "Toward Loosely Coupled Programming on Petascale Systems", and
was published in
SC08; the second paper is titled "Many-Task Computing for Grids and
Supercomputers",
which was published in MTAGS08. Furthermore, to see last year's workshop
program agenda,
and accepted papers and presentations, please
seehttp://dsl.cs.uchicago.edu/MTAGS08/.
For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/.
Topics
---------------------------------------------------------------------------------------
MTAGS 2008 topics of interest include, but are not limited to:
* Compute Resource Management in large scale clusters, large Grids,
Supercomputers,
or Cloud Computing infrastructure
o Scheduling
o Job execution frameworks
o Local resource manager extensions
o Performance evaluation of resource managers in use on large
scale systems
o Challenges and opportunities in running many-task workloads on
HPC systems
o Challenges and opportunities in running many-task workloads
on Cloud Computing infrastructure
* Data Management in large scale Grid and Supercomputer environments:
o Data-Aware Scheduling
o Parallel File System performance and scalability in large
deployments
o Distributed file systems
o Data caching frameworks and techniques
* Large-Scale Workflow Systems
o Workflow system performance and scalability analysis
o Scalability of workflow systems
o Workflow infrastructure and e-Science middleware
o Programming Paradigms and Models
* Large-Scale Many-Task Applications
o Large-scale many-task applications
o Large-scale many-task data-intensive applications
o Large-scale high throughput computing (HTC) applications
o Quasi-supercomputing applications, deployments, and experiences
Paper Submission and Publication
---------------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of
not more than
10 pages of double column text using single spaced 10 point size on 8.5
x 11 inch pages,
as per ACM 8.5 x 11 manuscript guidelines
(http://www.acm.org/publications/instructions_for_proceedings_volumes);
document
templates can be found
athttp://www.acm.org/sigs/publications/proceedings-templates.
A 250 word abstract (PDF format) must be submitted online at
https://cmt.research.microsoft.com/MTAGS2009/ before the deadline of
August 1st, 2009
at 11:59PM PST; the final 10 page papers in PDF format will be due on
September 1st,
2009 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers
will be
published in the workshop proceedings as part of the ACM digital
library. Notifications
of the paper decisions will be sent out by October 1st, 2009. Selected
excellent work
will be invited to submit extended versions of the workshop paper to the
IEEE Transactions on Parallel and Distributed Systems (TPDS) Journal,
Special Issue on
Many-Task Computing (due December 21st, 2009); for more information
about this journal
special issue, please visithttp://dsl.cs.uchicago.edu/TPDS_MTC/.
Submission implies
the willingness of at least one of the authors to register and present
the paper. For
more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/.
Important Dates
---------------------------------------------------------------------------------------
* Abstract Due: August 1st, 2009
* Papers Due: September 1st, 2009
* Notification of Acceptance: October 1st, 2009
* Camera Ready Papers Due: November 1st, 2009
* Workshop Date: November 16th, 2009
Committee Members
---------------------------------------------------------------------------------------
Workshop Chairs
* Ioan Raicu, University of Chicago
* Ian Foster, University of Chicago& Argonne National Laboratory
* Yong Zhao, Microsoft
Technical Committee (confirmed)
* David Abramson, Monash University, Australia
* Pete Beckman, Argonne National Laboratory, USA
* Peter Dinda, Northwestern University, USA
* Ian Foster, University of Chicago& Argonne National Laboratory,
USA
* Bob Grossman, University of Illinois at Chicago, USA
* Indranil Gupta, University of Illinois at Urbana Champaign, USA
* Alexandru Iosup, Delft University of Technology, Netherlands
* Kamil Iskra, Argonne National Laboratory, USA
* Chuang Liu, Ask.com, USA
* Zhou Lei, Shanghai University, China
* Shiyong Lu, Wayne State University, USA
* Reagan Moore, University of North Carolina at Chapel Hill, USA
* Marlon Pierce, Indiana University, USA
* Ioan Raicu, University of Chicago, USA
* Matei Ripeanu, University of British Columbia, Canada
* David Swanson, University of Nebraska, USA
* Greg Thain, Univeristy of Wisconsin, USA
* Matthew Woitaszek, The University Corporation for Atmospheric
Research, USA
* Mike Wilde, University of Chicago& Argonne National Laboratory,
USA
* Sherali Zeadally, University of the District of Columbia, USA
* Yong Zhao, Microsoft, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From tanu00 at gmail.com Fri Jul 17 11:01:11 2009
From: tanu00 at gmail.com (Tanu Malik)
Date: Fri, 17 Jul 2009 12:01:11 -0400
Subject: [Swift-devel] Provenance DB for Swift
Message-ID: <66d19ae50907170901n533e1f4dl686f22b1c747cf7e@mail.gmail.com>
Hi Ben, Mike
I was wondering if there is open access to the Provenance DB for Swift ?
We have built a provenance query and database that performs distributed
provenance querying. Our examples are currently all artificial and I was
wondering if we can test the same with Provenance DB for Swift.
We have a deadline in Sept. and an early reply from you will be very helpful.
Thanks,
Tanu
From hategan at mcs.anl.gov Fri Jul 17 15:35:21 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 17 Jul 2009 15:35:21 -0500
Subject: [Swift-devel] Coaster CPU-time consumption issue
In-Reply-To: <1247524902.25358.3.camel@localhost>
References: <4A5B64C7.4080802@mcs.anl.gov>
<1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov>
<1247509395.20144.4.camel@localhost>
<50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com>
<1247511969.21171.4.camel@localhost>
<50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com>
<4A5BA007.2050101@mcs.anl.gov>
<50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com>
<1247524902.25358.3.camel@localhost>
Message-ID: <1247862921.9627.1.camel@localhost>
On that same topic, cog r2438 removes another spin that would get
triggered in certain circumstances (after a bunch of jobs are done).
On Mon, 2009-07-13 at 17:41 -0500, Mihael Hategan wrote:
> A slightly modified version of this is in cog r2429.
>
> Thanks again,
>
> Mihael
>
> On Mon, 2009-07-13 at 17:04 -0500, Allan Espinosa wrote:
> > hi,
> >
> > here is a patch which solves the cpu usage on the bootstrap coaster
> > service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
> >
> > suggested svn log entry:
> > Added locks via wait() and notify() to prevent busy waiting/
> > active polling in the block task queue.
> >
> >
> > Test 2000 touch job using 066-many.swift via local:local :
> > before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
> > after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
> >
> > CPU usage drops from 100% to 0% with a few 25-40 % spikes!
> >
> > -Allan
> >
> >
> > 2009/7/13 Michael Wilde :
> > > Hi Allan,
> > >
> > > I think the methods you want for synchronization are part of class Object.
> > >
> > > They are documented in the chapter Threads and Locks of The Java Language
> > > Specification:
> > >
> > > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
> > >
> > > queue.wait() should be called if the queue is empty.
> > >
> > > queue.notify() or .notifyall() should be called when something is added to
> > > the queue. I think notify() should work.
> > >
> > > .wait will I think take a timer, but suspect you dont need that.
> > >
> > > Both should be called within the synchronized(queue) constructs that are
> > > already in the code.
> > >
> > > Should be fun to fix this!
> > >
> > > - Mike
> > >
> > >
> > >
> > >
> > >
> > > On 7/13/09 2:12 PM, Allan Espinosa wrote:
> > >>
> > >> 97% is an average as can be seen in run06. swift version is r3005 and
> > >> cogkit r2410. this is a vanilla build of swift.
> > >>
> > >> 2009/7/13 Mihael Hategan :
> > >>>
> > >>> A while ago I committed a patch to run the service process with a lower
> > >>> priority. Is that in use?
> > >>>
> > >>> Also, is logging reduced or is it the default?
> > >>>
> > >>> Is the 97% CPU usage a spike, or does it stay there on average?
> > >>>
> > >>> Can I take a look at the coaster logs from skenny's run on ranger?
> > >>>
> > >>> I'd also like to point out in as little offensive mode as I can, that
> > >>> I'm working 100% on I2U2 and my lack of getting more than lightly
> > >>> involved in this is a consequence of that.
> > >>>
> > >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
> > >>>>
> > >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
> > >>>> here process 22395 is the child of the main java process
> > >>>> (bootstrap.jar) and is loading the CPU.
> > >>>>
> > >>>> I have coasters.log, worker-*log, swift logs, gram logs in
> > >>>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
> > >>>> PID 15206 is the child java process of bootstrap.jar in here.
> > >>>>
> > >>>> top snapshot:
> > >>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80,
> > >>>> 0.55
> > >>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
> > >>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> > >>>> 0.0%st
> > >>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
> > >>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
> > >>>>
> > >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > >>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
> > >>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
> > >>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10
> > >>>> globus-job-mana
> > >>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
> > >>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
> > >>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
> > >>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
> > >>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
> > >>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09
> > >>>> globus-job-man
> > >>>>
> > >>>> ps snapshot:
> > >>>>
> > >>>> 22328 ? S 0:00 \_ /bin/bash
> > >>>> 22364 ? Sl 0:00 \_
> > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> > >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
> > >>>>
> > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> > >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
> > >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
> > >>>> https://128.135.125.17:46519 11505253269
> > >>>> 22395 ? SNl 6:29 \_
> > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
> > >>>>
> > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
> > >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
> > >>>> -Djava.security.egd=file:///dev/urandom -cp
> > >>>>
> > >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196f
cd
> ec9
> > >
> > > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b860632568434270
1c
> .jar
> > > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendez
vo
> us_s
> > > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2009/7/13 Mihael Hategan :
> > >>>>>
> > >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
> > >>>>>>>>
> > >>>>>>>> At the time we did not have a chance to gather detailed evidence,
> > >>>>>>>> but I
> > >>>>>>>> was surprised by two things:
> > >>>>>>>>
> > >>>>>>>> - that there were two Java processes and that one was so big. (Are
> > >>>>>>>> most
> > >>>>>>>> likely the active process was just a child thread of the main
> > >>>>>>>> process?)
> > >>>>>>>
> > >>>>>>> One java process is the bootstrap process (it downloads the coaster
> > >>>>>>> jars, sets up the environment and runs the coaster service). It has
> > >>>>>>> always been like this. Did you happen to capture the output of ps to
> > >>>>>>> a
> > >>>>>>> file? That would be useful, because from what you are suggesting, it
> > >>>>>>> appears that the bootstrap process is eating 100% CPU. That process
> > >>>>>>> should only be sleeping after the service is started.
> > >>>>>>
> > >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I
> > >>>>>> cant
> > >>>>>> locate it.
> > >>>>>>
> > >>>>>> As best as I can recall it showed the larger memory-footprint process
> > >>>>>> to
> > >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to
> > >>>>>> be burning 100% of a CPU.
> > >>>>>
> > >>>>> Normally, the smaller footprint process should be the bootstrap. But
> > >>>>> that's why I would like the ps output, because it sounds odd.
> > >>>>>
> > >>>>>> Allan will try to get a snapshot of this shortly.
> > >>>>>>
> > >>>>>> If this observation if correct, whats the best way to find out where
> > >>>>>> its
> > >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
> > >>>>>> JVM that doesnt exit?
> > >>>>>
> > >>>>> Once I know where it is, I can look at the code and then we'll go from
> > >>>>> there.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> >
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Mon Jul 20 17:11:04 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 20 Jul 2009 17:11:04 -0500
Subject: [Swift-devel] coasters submit jobs with "count=0" in its globus RSL
params
Message-ID: <50b07b4b0907201511p758167f2v3fe24a5dca1da099@mail.gmail.com>
session message:
Caused by:
Block task failed: Error submitting block task
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Cannot submit job
at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146)
at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100)
at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66)
Caused by: org.globus.gram.GramException: The provided RSL 'count'
value is invalid (not an integer or must be greater than 0)
at org.globus.gram.Gram.request(Gram.java:358)
at org.globus.gram.GramJob.request(GramJob.java:262)
at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134)
... 4 more
Cleaning up...
Shutting down service at https://129.114.50.163:45035
snippet of coasters.log:
2009-07-20 17:02:02,344-0500 INFO BlockQueueProcessor
Settings {
slots = 2
workersPerNode = 16
nodeGranularity = 1
allocationStepSize = 0.1
maxNodes = 2
lowOverallocation = 10.0
highOverallocation = 1.0
overallocationDecayFactor = 0.0010
spread = 0.9
reserve = 10.000s
maxtime = 86400
project = TG-CCR080022N
queue = normal
remoteMonitorEnabled = false
}
2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor Required size:
230400 for 16 jobs
2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 28800, jj:
14400, x-last: , r: 1
2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 43200, w: 2,
size: 230400, msz: 230400, w*h: 86400
2009-07-20 17:02:02,355-0500 INFO BlockQueueProcessor Added: 0 - 5
2009-07-20 17:02:02,355-0500 INFO Block Starting block: workers=2,
walltime=43200.000s
2009-07-20 17:02:02,358-0500 INFO BlockTaskSubmitter Queuing block
Block 0720-010553-000000 (2x43200.000s) for submission
2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Added 6 jobs to
new blocks
2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Plan time: 55
2009-07-20 17:02:02,359-0500 INFO BlockTaskSubmitter Submitting block
Block 0720-010553-000000 (2x43200.000s)
2009-07-20 17:02:02,379-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1248127320448) setting status to Submitting
2009-07-20 17:02:02,381-0500 INFO Block Block task status changed: Submitting
---end--
with w=2, count = 2 / 16 = 0 when a Block is instantiated.
sites.xml:
TG-CCR080022N
16
normal
10000
0.32
2
2
4:00:00
86400
/scratch/01035/tg802895/see_runs
obviously i need to get the right mix of overAllocation parameters.
but an invalid RSL entry should at least be caught.
I'll try to understand better BlockQueueProcessor.allocateBlocks to
have at least an intelligent guess on what these values should be.
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From wilde at mcs.anl.gov Mon Jul 20 17:18:31 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 20 Jul 2009 17:18:31 -0500
Subject: [Swift-devel] coasters submit jobs with "count=0" in its globus
RSL params
In-Reply-To: <50b07b4b0907201511p758167f2v3fe24a5dca1da099@mail.gmail.com>
References: <50b07b4b0907201511p758167f2v3fe24a5dca1da099@mail.gmail.com>
Message-ID: <4A64ED37.7010108@mcs.anl.gov>
Sarah, is this the same error you have been getting? (Invalid RSL count
field?)
- Mike
On 7/20/09 5:11 PM, Allan Espinosa wrote:
> session message:
> Caused by:
> Block task failed: Error submitting block task
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job
> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146)
> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100)
> at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66)
> Caused by: org.globus.gram.GramException: The provided RSL 'count'
> value is invalid (not an integer or must be greater than 0)
> at org.globus.gram.Gram.request(Gram.java:358)
> at org.globus.gram.GramJob.request(GramJob.java:262)
> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134)
> ... 4 more
>
> Cleaning up...
> Shutting down service at https://129.114.50.163:45035
>
> snippet of coasters.log:
> 2009-07-20 17:02:02,344-0500 INFO BlockQueueProcessor
> Settings {
> slots = 2
> workersPerNode = 16
> nodeGranularity = 1
> allocationStepSize = 0.1
> maxNodes = 2
> lowOverallocation = 10.0
> highOverallocation = 1.0
> overallocationDecayFactor = 0.0010
> spread = 0.9
> reserve = 10.000s
> maxtime = 86400
> project = TG-CCR080022N
> queue = normal
> remoteMonitorEnabled = false
> }
>
> 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor Required size:
> 230400 for 16 jobs
> 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 28800, jj:
> 14400, x-last: , r: 1
> 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 43200, w: 2,
> size: 230400, msz: 230400, w*h: 86400
> 2009-07-20 17:02:02,355-0500 INFO BlockQueueProcessor Added: 0 - 5
> 2009-07-20 17:02:02,355-0500 INFO Block Starting block: workers=2,
> walltime=43200.000s
> 2009-07-20 17:02:02,358-0500 INFO BlockTaskSubmitter Queuing block
> Block 0720-010553-000000 (2x43200.000s) for submission
> 2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Added 6 jobs to
> new blocks
> 2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Plan time: 55
> 2009-07-20 17:02:02,359-0500 INFO BlockTaskSubmitter Submitting block
> Block 0720-010553-000000 (2x43200.000s)
> 2009-07-20 17:02:02,379-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:cog-1248127320448) setting status to Submitting
> 2009-07-20 17:02:02,381-0500 INFO Block Block task status changed: Submitting
> ---end--
>
> with w=2, count = 2 / 16 = 0 when a Block is instantiated.
>
> sites.xml:
>
>
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> TG-CCR080022N
> 16
> normal
> 10000
> 0.32
> 2
> 2
> 4:00:00
> 86400
>
> url="gt2://gatekeeper.ranger.tacc.teragrid.org" />
> /scratch/01035/tg802895/see_runs
>
>
>
> obviously i need to get the right mix of overAllocation parameters.
> but an invalid RSL entry should at least be caught.
>
> I'll try to understand better BlockQueueProcessor.allocateBlocks to
> have at least an intelligent guess on what these values should be.
>
>
From smartin at mcs.anl.gov Tue Jul 21 10:58:15 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Tue, 21 Jul 2009 10:58:15 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
Message-ID:
Are there any swift apps that can use queen bee? There is a GRAM5
service setup there for testing.
-Stu
Begin forwarded message:
> From: Stuart Martin
> Date: July 21, 2009 10:56:04 AM CDT
> To: gateways at teragrid.org
> Cc: Stuart Martin , Lukasz Lacinski >
> Subject: Fwd: [gram-user] GRAM5 Alpha2
>
> Hi Gateways,
>
> Any gateways that use (or can use) Queen Bee, it would be great if
> you could target this new GRAM5 service that Lukasz deployed. I
> heard from Lukasz that Jim has submitted a gateway user (SAML) job
> and that went through fine and populate the gram audit DB
> correctly. Thanks Jim! It would be nice to have some gateway push
> the service to test scalability.
>
> Let us know if you plan to do this.
>
> Thanks,
> Stu
>
> Begin forwarded message:
>
>> From: Lukasz Lacinski
>> Date: July 21, 2009 1:18:05 AM CDT
>> To: gram-user at lists.globus.org
>> Subject: [gram-user] GRAM5 Alpha2
>>
>> I've installed GRAM5 Alpha2 on Queen Bee.
>>
>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>
>> -seg-module pbs works fine.
>> GRAM audit with PostgreSQL works fine.
>>
>> Can someone submit jobs as a gateway user? I'd like to check if the
>> gateway_user field is written to our audit database.
>>
>> Thanks,
>> Lukasz
>
From tiberius at ci.uchicago.edu Tue Jul 21 11:05:28 2009
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Tue, 21 Jul 2009 11:05:28 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To:
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
Message-ID:
Hi Stu
I was just installing yesterday my application on queenbee. So I could
do some testing for you, just let me know how to take advantage of the
new GRAM5
Does cogkit/swift already support GRAM5 ?
Tibi
On Tue, Jul 21, 2009 at 10:58 AM, Stuart Martin wrote:
> Are there any swift apps that can use queen bee? ?There is a GRAM5 service
> setup there for testing.
>
> -Stu
>
> Begin forwarded message:
>
>> From: Stuart Martin
>> Date: July 21, 2009 10:56:04 AM CDT
>> To: gateways at teragrid.org
>> Cc: Stuart Martin , Lukasz Lacinski
>>
>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>
>> Hi Gateways,
>>
>> Any gateways that use (or can use) Queen Bee, it would be great if you
>> could target this new GRAM5 service that Lukasz deployed. ?I heard from
>> Lukasz that Jim has submitted a gateway user (SAML) job and that went
>> through fine and populate the gram audit DB correctly. ?Thanks Jim! ?It
>> would be nice to have some gateway push the service to test scalability.
>>
>> Let us know if you plan to do this.
>>
>> Thanks,
>> Stu
>>
>> Begin forwarded message:
>>
>>> From: Lukasz Lacinski
>>> Date: July 21, 2009 1:18:05 AM CDT
>>> To: gram-user at lists.globus.org
>>> Subject: [gram-user] GRAM5 Alpha2
>>>
>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>
>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>
>>> -seg-module pbs works fine.
>>> GRAM audit with PostgreSQL works fine.
>>>
>>> Can someone submit jobs as a gateway user? I'd like to check if the
>>> gateway_user field is written to our audit database.
>>>
>>> Thanks,
>>> Lukasz
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
--
Tiberiu (Tibi) Stef-Praun, PhD
Computational Sciences Researcher
Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/
From hategan at mcs.anl.gov Tue Jul 21 11:20:34 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 21 Jul 2009 11:20:34 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To:
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
Message-ID: <1248193234.11850.25.camel@localhost>
On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote:
> Hi Stu
>
> I was just installing yesterday my application on queenbee. So I could
> do some testing for you, just let me know how to take advantage of the
> new GRAM5
> Does cogkit/swift already support GRAM5 ?
Should work with it out-of-the-box. But then testing is for verifying
that.
From smartin at mcs.anl.gov Tue Jul 21 11:29:37 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Tue, 21 Jul 2009 11:29:37 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To: <1248193234.11850.25.camel@localhost>
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<1248193234.11850.25.camel@localhost>
Message-ID:
Wonderful. Let us know how it goes.
-Stu
On Jul 21, 2009, at Jul 21, 11:20 AM, Mihael Hategan wrote:
> On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote:
>> Hi Stu
>>
>> I was just installing yesterday my application on queenbee. So I
>> could
>> do some testing for you, just let me know how to take advantage of
>> the
>> new GRAM5
>> Does cogkit/swift already support GRAM5 ?
>
> Should work with it out-of-the-box. But then testing is for verifying
> that.
>
>
From tiberius at ci.uchicago.edu Tue Jul 21 11:30:45 2009
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Tue, 21 Jul 2009 11:30:45 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To:
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<1248193234.11850.25.camel@localhost>
Message-ID:
So how do I test ?
Some instructions would help ...
On Tue, Jul 21, 2009 at 11:29 AM, Stuart Martin wrote:
> Wonderful. ?Let us know how it goes.
>
> -Stu
>
> On Jul 21, 2009, at Jul 21, 11:20 AM, Mihael Hategan wrote:
>
>> On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote:
>>>
>>> Hi Stu
>>>
>>> I was just installing yesterday my application on queenbee. So I could
>>> do some testing for you, just let me know how to take advantage of the
>>> new GRAM5
>>> Does cogkit/swift already support GRAM5 ?
>>
>> Should work with it out-of-the-box. But then testing is for verifying
>> that.
>>
>>
>
>
--
Tiberiu (Tibi) Stef-Praun, PhD
Computational Sciences Researcher
Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/
From hategan at mcs.anl.gov Tue Jul 21 11:39:10 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 21 Jul 2009 11:39:10 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To:
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<1248193234.11850.25.camel@localhost>
Message-ID: <1248194350.12672.0.camel@localhost>
1. Install your app on queenbee
2. Find the jobmanager contact for gram5 and put that in your sites.xml,
together with the gridftp contact
3. Run swift
On Tue, 2009-07-21 at 11:30 -0500, Tiberiu Stef-Praun wrote:
> So how do I test ?
> Some instructions would help ...
>
>
> On Tue, Jul 21, 2009 at 11:29 AM, Stuart Martin wrote:
> > Wonderful. Let us know how it goes.
> >
> > -Stu
> >
> > On Jul 21, 2009, at Jul 21, 11:20 AM, Mihael Hategan wrote:
> >
> >> On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote:
> >>>
> >>> Hi Stu
> >>>
> >>> I was just installing yesterday my application on queenbee. So I could
> >>> do some testing for you, just let me know how to take advantage of the
> >>> new GRAM5
> >>> Does cogkit/swift already support GRAM5 ?
> >>
> >> Should work with it out-of-the-box. But then testing is for verifying
> >> that.
> >>
> >>
> >
> >
>
>
>
From aespinosa at cs.uchicago.edu Tue Jul 21 11:49:21 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 21 Jul 2009 11:49:21 -0500
Subject: [Swift-devel] more on # of coasters workers vs actual requested on
ranger
Message-ID: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com>
According to the gram logs, swift sends requests for blocks of 1, 2, 3
and 4 nodes but SGE receives requests for four 1 node jobs. This
maybe a GRAM2-SGE interaction problem. Is there a way to get the
globus RSL files from swift so I can submit manually and verify this?
-Allan
coasters.log:
...
...
2009-07-21 10:46:13,788-0500 INFO BlockQueueProcessor Required size:
28800 for 2 jobs
2009-07-21 10:46:13,788-0500 INFO BlockQueueProcessor h: 28800, jj:
14400, x-last: , r: 1
2009-07-21 10:46:13,788-0500 INFO BlockQueueProcessor h: 43200, w:
16, size: 28800, msz: 28800, w*h: 691200
2009-07-21 10:46:13,797-0500 INFO BlockQueueProcessor Added: 0 - 1
2009-07-21 10:46:13,797-0500 INFO Block Starting block: workers=16,
walltime=43200.000s
2009-07-21 10:46:13,859-0500 INFO BlockTaskSubmitter Queuing block
Block 0721-461009-000000 (16x43200.000s) for submission
2009-07-21 10:46:13,859-0500 INFO BlockQueueProcessor Added 2 jobs to
new blocks
2009-07-21 10:46:13,860-0500 INFO BlockQueueProcessor Plan time: 287
2009-07-21 10:46:13,863-0500 INFO BlockTaskSubmitter Submitting block
Block 0721-461009-000000 (16x43200.000s)
2009-07-21 10:46:13,887-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1248191171562) setting status to Submitting
2009-07-21 10:46:13,889-0500 INFO Block Block task status changed: Submitting
2009-07-21 10:46:15,339-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1248191171562) setting status to Submitted
2009-07-21 10:46:15,339-0500 INFO Block Block task status changed: Submitted
...
...
...
2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor Required size:
1152000 for 80 jobs
2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor h: 28800, jj:
14400, x-last: , r: 31
2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor h: 43200, w:
48, size: 1152000, msz: 1152000, w*h: 2073600
2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor Added: 0 - 79
2009-07-21 10:46:31,545-0500 INFO Block Starting block: workers=48,
walltime=43200.000s
2009-07-21 10:46:31,546-0500 INFO BlockTaskSubmitter Queuing block
Block 0721-461009-000001 (48x43200.000s) for submission
2009-07-21 10:46:31,546-0500 INFO BlockQueueProcessor Added 80 jobs
to new blocks
2009-07-21 10:46:31,546-0500 INFO BlockQueueProcessor Plan time: 3
2009-07-21 10:46:31,546-0500 INFO BlockTaskSubmitter Submitting block
Block 0721-461009-000001 (48x43200.000s)
2009-07-21 10:46:31,546-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1248191171941) setting status to Submitting
2009-07-21 10:46:31,547-0500 INFO Block Block task status changed: Submitting
...
...
2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor Requeued 133
non-fitting jobs
2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor Required size:
1915200 for 133 jobs
2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor h: 28800, jj:
14400, x-last: , r: 4
2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor h: 43200, w:
64, size: 1915200, msz: 1915200, w*h: 2764800
2009-07-21 10:46:33,756-0500 INFO BlockQueueProcessor Added: 0 - 132
2009-07-21 10:46:33,756-0500 INFO Block Starting block: workers=64,
walltime=43200.000s
2009-07-21 10:46:33,756-0500 INFO BlockTaskSubmitter Queuing block
Block 0721-461009-000002 (64x43200.000s) for submission
2009-07-21 10:46:33,757-0500 INFO BlockQueueProcessor Added 133 jobs
to new blocks
2009-07-21 10:46:33,757-0500 INFO BlockQueueProcessor Plan time: 4
...
...
2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor Required size:
705600 for 49 jobs
2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor h: 28800, jj:
14400, x-last: , r: 16
2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor h: 43200, w:
32, size: 705600, msz: 705600, w*h: 1382400
2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor Added: 0 - 48
2009-07-21 10:46:35,980-0500 INFO Block Starting block: workers=32,
walltime=43200.000s
2009-07-21 10:46:35,981-0500 INFO BlockTaskSubmitter Queuing block
Block 0721-461009-000003 (32x43200.000s) for submission
2009-07-21 10:46:35,981-0500 INFO BlockQueueProcessor Added 49 jobs
to new blocks
2009-07-21 10:46:35,981-0500 INFO BlockQueueProcessor Plan time: 4
2009-07-21 10:46:35,981-0500 INFO BlockTaskSubmitter Submitting block
Block 0721-461009-000003 (32x43200.000s)
2009-07-21 10:46:35,981-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:cog-1248191172858) setting status to Submitting
2009-07-21 10:46:35,982-0500 INFO Block Block task status changed: Submitting
...
...
gram log snippets:
log1: (16 cpus)
...
7/21 10:46:14 Pre-parsed RSL string: &( rsl_substitution =
(GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue =
"normal" )( project = "TG-CCR080022N"
)( stdout = $(GLOBUSRUN_GASS_URL) #
"/dev/stdout-urn:cog-1248191171562" )( arguments =
"/share/home/01035/tg802895/.globus/coasters/cscript26994.pl"
"http://1
29.114.50.163:52072" "0721-461009-000000" "16" )( count = "1" )(
executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) #
"/dev/stderr-urn:cog-12481911
71562" )( maxwalltime = "720" )
7/21 10:46:14
...
log2: (48 cpus)
...
7/21 10:46:32 Pre-parsed RSL string: &( rsl_substitution =
(GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue =
"normal" )( project = "TG-CCR080022N"
)( stdout = $(GLOBUSRUN_GASS_URL) #
"/dev/stdout-urn:cog-1248191171941" )( arguments =
"/share/home/01035/tg802895/.globus/coasters/cscript26994.pl"
"http://1
29.114.50.163:52072" "0721-461009-000001" "16" )( count = "3" )(
executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) #
"/dev/stderr-urn:cog-12481911
71941" )( maxwalltime = "720" )
7/21 10:46:32
...
log3: (64 cpus)
...
7/21 10:46:34 Pre-parsed RSL string: &( rsl_substitution =
(GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue =
"normal" )( project = "TG-CCR080022N"
)( stdout = $(GLOBUSRUN_GASS_URL) #
"/dev/stdout-urn:cog-1248191172533" )( arguments =
"/share/home/01035/tg802895/.globus/coasters/cscript26994.pl"
"http://1
29.114.50.163:52072" "0721-461009-000002" "16" )( count = "4" )(
executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) #
"/dev/stderr-urn:cog-12481911
72533" )( maxwalltime = "720" )
7/21 10:46:34
...
log4: (32 cpus)
...
7/21 10:46:36 Pre-parsed RSL string: &( rsl_substitution =
(GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue =
"normal" )( project = "TG-CCR080022N"
)( stdout = $(GLOBUSRUN_GASS_URL) #
"/dev/stdout-urn:cog-1248191172858" )( arguments =
"/share/home/01035/tg802895/.globus/coasters/cscript26994.pl"
"http://1
29.114.50.163:52072" "0721-461009-000003" "16" )( count = "2" )(
executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) #
"/dev/stderr-urn:cog-12481911
72858" )( maxwalltime = "720" )
7/21 10:46:36
...
what was actually requested:
login4$ showq -u
ACTIVE JOBS--------------------------
JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
================================================================================
0 active jobs : 0 of 3828 hosts ( 0.00 %)
WAITING JOBS------------------------
JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME
================================================================================
873041 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:17
873043 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:33
873044 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:36
873045 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:38
WAITING JOBS WITH JOB DEPENDENCIES---
JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME
================================================================================
UNSCHEDULED JOBS---------------------
JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME
================================================================================
Total jobs: 4 Active Jobs: 0 Waiting Jobs: 4 Dep/Unsched Jobs: 0
login4$
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Tue Jul 21 11:59:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 21 Jul 2009 11:59:58 -0500
Subject: [Swift-devel] more on # of coasters workers vs actual
requested on ranger
In-Reply-To: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com>
References: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com>
Message-ID: <1248195598.12972.6.camel@localhost>
On Tue, 2009-07-21 at 11:49 -0500, Allan Espinosa wrote:
> According to the gram logs, swift sends requests for blocks of 1, 2, 3
> and 4 nodes but SGE receives requests for four 1 node jobs. This
> maybe a GRAM2-SGE interaction problem. Is there a way to get the
> globus RSL files from swift so I can submit manually and verify this?
In cog/modules/coaster/resources/log4.properties add:
log4j.logger.org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG
Then re-compile.
But I don't think you need to go that far. Write your own RSL. In
particular I'd suggest trying with both jobType=multiple and without.
From aespinosa at cs.uchicago.edu Tue Jul 21 13:13:12 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 21 Jul 2009 13:13:12 -0500
Subject: [Swift-devel] more on # of coasters workers vs actual requested
on ranger
In-Reply-To: <1248195598.12972.6.camel@localhost>
References: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com>
<1248195598.12972.6.camel@localhost>
Message-ID: <50b07b4b0907211113l4e554679j3de4743a7030fe85@mail.gmail.com>
aha.
on Ranger the count clause , refers to the number of cpus hence when
coasters is requesting for count=4 it only needs 1 node. if we want
to do a workersPerNode=16 then we should manually specify host_count=4
instead of count=4. or just use workersPerNode=1
i'll do more rsl exploration and probably play with the coaster's
generation of GRAM2 requests.
-Allan
2009/7/21 Mihael Hategan :
> On Tue, 2009-07-21 at 11:49 -0500, Allan Espinosa wrote:
>> According to the gram logs, swift sends requests for blocks of 1, 2, 3
>> and 4 nodes but SGE receives requests for ?four 1 node jobs. ? This
>> maybe a GRAM2-SGE interaction problem. ?Is there a way to get the
>> globus RSL files from swift so I can submit manually and verify this?
>
> In cog/modules/coaster/resources/log4.properties add:
> log4j.logger.org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG
>
> Then re-compile.
>
> But I don't think you need to go that far. Write your own RSL. In
> particular I'd suggest trying with both jobType=multiple and without.
>
>
>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Tue Jul 21 13:20:32 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 21 Jul 2009 13:20:32 -0500
Subject: [Swift-devel] more on # of coasters workers vs actual
requested on ranger
In-Reply-To: <50b07b4b0907211113l4e554679j3de4743a7030fe85@mail.gmail.com>
References: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com>
<1248195598.12972.6.camel@localhost>
<50b07b4b0907211113l4e554679j3de4743a7030fe85@mail.gmail.com>
Message-ID: <1248200432.16593.6.camel@localhost>
On Tue, 2009-07-21 at 13:13 -0500, Allan Espinosa wrote:
> aha.
>
> on Ranger the count clause , refers to the number of cpus hence when
> coasters is requesting for count=4 it only needs 1 node. if we want
> to do a workersPerNode=16 then we should manually specify host_count=4
> instead of count=4. or just use workersPerNode=1
Ah, right. I remember this funny problem.
Can you find out how well this is supported in general? The gram docs
are a bit vague:
(hostCount=value)
Only applies to clusters of SMP computers, such as newer IBM SP
systems. Defines the number of nodes ("pizza boxes") to
distribute the "count" processes across.
>
> i'll do more rsl exploration and probably play with the coaster's
> generation of GRAM2 requests.
>
> -Allan
>
> 2009/7/21 Mihael Hategan :
> > On Tue, 2009-07-21 at 11:49 -0500, Allan Espinosa wrote:
> >> According to the gram logs, swift sends requests for blocks of 1, 2, 3
> >> and 4 nodes but SGE receives requests for four 1 node jobs. This
> >> maybe a GRAM2-SGE interaction problem. Is there a way to get the
> >> globus RSL files from swift so I can submit manually and verify this?
> >
> > In cog/modules/coaster/resources/log4.properties add:
> > log4j.logger.org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG
> >
> > Then re-compile.
> >
> > But I don't think you need to go that far. Write your own RSL. In
> > particular I'd suggest trying with both jobType=multiple and without.
> >
> >
> >
> >
>
>
>
From wilde at mcs.anl.gov Tue Jul 21 17:23:20 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 21 Jul 2009 17:23:20 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To:
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
Message-ID: <4A663FD8.3050909@mcs.anl.gov>
Yes, there are a few we can run on QueenBee.
Can try to test next week.
Allan, we can test SEE/AMPL, OOPS, and PTMap there.
- Mike
On 7/21/09 10:58 AM, Stuart Martin wrote:
> Are there any swift apps that can use queen bee? There is a GRAM5
> service setup there for testing.
>
> -Stu
>
> Begin forwarded message:
>
>> From: Stuart Martin
>> Date: July 21, 2009 10:56:04 AM CDT
>> To: gateways at teragrid.org
>> Cc: Stuart Martin , Lukasz Lacinski
>>
>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>
>> Hi Gateways,
>>
>> Any gateways that use (or can use) Queen Bee, it would be great if you
>> could target this new GRAM5 service that Lukasz deployed. I heard
>> from Lukasz that Jim has submitted a gateway user (SAML) job and that
>> went through fine and populate the gram audit DB correctly. Thanks
>> Jim! It would be nice to have some gateway push the service to test
>> scalability.
>>
>> Let us know if you plan to do this.
>>
>> Thanks,
>> Stu
>>
>> Begin forwarded message:
>>
>>> From: Lukasz Lacinski
>>> Date: July 21, 2009 1:18:05 AM CDT
>>> To: gram-user at lists.globus.org
>>> Subject: [gram-user] GRAM5 Alpha2
>>>
>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>
>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>
>>> -seg-module pbs works fine.
>>> GRAM audit with PostgreSQL works fine.
>>>
>>> Can someone submit jobs as a gateway user? I'd like to check if the
>>> gateway_user field is written to our audit database.
>>>
>>> Thanks,
>>> Lukasz
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Thu Jul 23 11:40:15 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 23 Jul 2009 11:40:15 -0500
Subject: [Swift-devel] coaster workers not receiving enough jobs
Message-ID: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
I tried 0660-many.swift with 200 5min sleep jobs using local:local
mode (since queue on ranger and teraport takes a while to finish).
The session spawned 192 workers. Swift reports at most 36 active
processes at a time (which it finished successfully). After that
workers reach idle time exceptions. Logs and stuff are in
~aespinosa/workflows/coaster_debug/run1/
sites.xml:
/home/aespinosa/workflows/coaster_debug/workdir
10000
1.98
1
00:05:00
3600
swift session:
Swift svn swift-r3011 cog-r2439
RunID: locallog
Progress:
Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1
Progress: Selecting site:1 Submitting:198 Submitted:1
Progress: Selecting site:1 Submitted:198 Active:1
Progress: Selecting site:1 Submitted:192 Active:7
Progress: Selecting site:1 Submitted:188 Active:11
Progress: Selecting site:1 Submitted:181 Active:18
Progress: Selecting site:1 Submitted:178 Active:21
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1
Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8
Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16
Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:133 Active:31 Finished successfully:36
Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
Progress: Stage in:11 Submitting:34 Submitted:113 Active:6
Finished successfully:36
Progress: Submitted:157 Active:7 Finished successfully:36
Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
...
... (not yet finished)
$grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
Task(type=JOB_SUBMISSION,
identity=urn:1248364974288-1248364979260-1248364979261) setting status
to Active workerid=000055
66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
Task(type=JOB_SUBMISSION,
identity=urn:1248364974280-1248364979248-1248364979249) setting status
to Active workerid=000051
$ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
61 worker-0723-021156-000001.log:1248365290 000054 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
62 worker-0723-021156-000001.log:1248365290 000050 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
63 worker-0723-021156-000001.log:1248365290 000053 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
64 worker-0723-021156-000001.log:1248365290 000052 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
65 worker-0723-021156-000001.log:1248365290 000051 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
66 worker-0723-021156-000001.log:1248365290 000055 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
it corresponds correctly with the swift session (more or less) since
we had 30+ completed jobs.
Some lines in coasters.log i find intersting:
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Submitted
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Active
2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106,
JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS)
CMD: Command(106, JOBSTATUS)
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Failed Block ta
sk failed: 0723-021156-000001Block task ended prematurely
Statement unlikely to be reached at
/home/aespinosa/.globus/coasters/cscript15423.pl line 580.
(Maybe you meant system() when you said exec()?)
2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107,
JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS)
CMD: Command(107, JOBSTATUS)
-Allan
From hategan at mcs.anl.gov Thu Jul 23 11:49:33 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 23 Jul 2009 11:49:33 -0500
Subject: [Swift-devel] coaster workers not receiving enough jobs
In-Reply-To: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
Message-ID: <1248367773.25313.5.camel@localhost>
On Thu, 2009-07-23 at 11:40 -0500, Allan Espinosa wrote:
> Some lines in coasters.log i find intersting:
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Failed Block ta
> sk failed: 0723-021156-000001Block task ended prematurely
>
> Statement unlikely to be reached at
> /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
> (Maybe you meant system() when you said exec()?)
>
I think perl is being extra-cautios there. The sequence of commands is
the following:
exec { $executable } @JOBARGS;
print $WR "Could not execute $executable: $!\n";
die "Could not execute $executable: $!";
If exec succeeds, the print statement is indeed unreachable. However, it
is there to deal with the case when exec doesn't succeed.
There are ways to write it to avoid that warning, but that warning isn't
indicative of an actual problem here.
From aespinosa at cs.uchicago.edu Thu Jul 23 12:08:17 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 23 Jul 2009 12:08:17 -0500
Subject: [Swift-devel] coaster workers not receiving enough jobs
In-Reply-To: <1248367773.25313.5.camel@localhost>
References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
<1248367773.25313.5.camel@localhost>
Message-ID: <50b07b4b0907231008qb309b6eu162eba00cf389663@mail.gmail.com>
Ah right.
Reading more into logs and code, my guess is that there is not enough
Cpu.pull() calls to get jobs from the coaster service:
$ grep pull coasters.log | grep -v Later | cat -n
62 2009-07-23 11:08:09,813-0500 INFO Cpu 0723-021156-000001:51 pull
63 2009-07-23 11:08:09,814-0500 INFO Cpu 0723-021156-000001:52 pull
64 2009-07-23 11:08:09,841-0500 INFO Cpu 0723-021156-000001:53 pull
65 2009-07-23 11:08:09,918-0500 INFO Cpu 0723-021156-000001:54 pull
66 2009-07-23 11:08:09,968-0500 INFO Cpu 0723-021156-000001:55 pull
67 2009-07-23 11:12:06,079-0500 INFO Cpu 0723-021156-000001:56 pull
These pull() calls get invoked in the bunch of cpus in the pullthread
correct? I'll read up on pullthreads and try to figure things out.
-Allan
2009/7/23 Mihael Hategan :
> On Thu, 2009-07-23 at 11:40 -0500, Allan Espinosa wrote:
>
>
> There are ways to write it to avoid that warning, but that warning isn't
> indicative of an actual problem here.
>
From hategan at mcs.anl.gov Thu Jul 23 12:35:44 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 23 Jul 2009 12:35:44 -0500
Subject: [Swift-devel] coaster workers not receiving enough jobs
In-Reply-To: <50b07b4b0907231008qb309b6eu162eba00cf389663@mail.gmail.com>
References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
<1248367773.25313.5.camel@localhost>
<50b07b4b0907231008qb309b6eu162eba00cf389663@mail.gmail.com>
Message-ID: <1248370544.27943.1.camel@localhost>
On Thu, 2009-07-23 at 12:08 -0500, Allan Espinosa wrote:
> Ah right.
>
> Reading more into logs and code, my guess is that there is not enough
> Cpu.pull() calls to get jobs from the coaster service:
>
> $ grep pull coasters.log | grep -v Later | cat -n
> 62 2009-07-23 11:08:09,813-0500 INFO Cpu 0723-021156-000001:51 pull
> 63 2009-07-23 11:08:09,814-0500 INFO Cpu 0723-021156-000001:52 pull
> 64 2009-07-23 11:08:09,841-0500 INFO Cpu 0723-021156-000001:53 pull
> 65 2009-07-23 11:08:09,918-0500 INFO Cpu 0723-021156-000001:54 pull
> 66 2009-07-23 11:08:09,968-0500 INFO Cpu 0723-021156-000001:55 pull
> 67 2009-07-23 11:12:06,079-0500 INFO Cpu 0723-021156-000001:56 pull
>
> These pull() calls get invoked in the bunch of cpus in the pullthread
> correct? I'll read up on pullthreads
I don't think there's some official kind of "pullthread". It's a
separate thread I wrote in order to allow waiting and avoid deadlocks.
> and try to figure things out.
>
> -Allan
>
> 2009/7/23 Mihael Hategan :
> > On Thu, 2009-07-23 at 11:40 -0500, Allan Espinosa wrote:
> >
> >
> > There are ways to write it to avoid that warning, but that warning isn't
> > indicative of an actual problem here.
> >
From andric at uchicago.edu Thu Jul 23 13:13:40 2009
From: andric at uchicago.edu (Michael Andric)
Date: Thu, 23 Jul 2009 13:13:40 -0500
Subject: [Swift-devel] errors from HNL machines/swift
Message-ID:
HI Support, Swift dev, anyone else reading,
I keep getting this crash on swift jobs submitted from HNL machines (both
andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). These happen for
different workflows, involving different processes. I am totally in the
dark as to what this error is referring to as well as to what may be causing
it. This crash has occurred on workflows that have just gone 'Active' as
well as on workflows that were running for hours before crashing.
Below is the error message. The log file is too big to attach but can be
found here:
/gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log
from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu)
Any insight is hugely appreciated - like i said, i don't even know what to
debug b/c i don't know what the error is referring to.
Michael
Progress: Submitted:11 Active:1
Progress: Active:10 Stage out:2
#
# An unexpected error has been detected by HotSpot Virtual Machine:
#
# SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208
#
# Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed mode, sharing)
# Problematic frame:
# C [libzip.so+0xfa62]
#
# An error report file with more information is saved as hs_err_pid32310.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
/gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 Aborted
java -Xmx2048M
-Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed
-DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME=
andrew.bsd.uchicago.edu -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/..
-Dvds.home=/gpfs/pads/fmri/apps/swift/bin/..
-Dswift.home=/gpfs/pads/fmri/apps/swift/bin/..
-Djava.security.egd=file:///dev/urandom -Xmx1024m -classpath
/gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar:
org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file'
'/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric'
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Thu Jul 23 13:20:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 23 Jul 2009 13:20:16 -0500
Subject: [Swift-devel] errors from HNL machines/swift
In-Reply-To:
References:
Message-ID: <1248373216.28628.1.camel@localhost>
There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to see that.
Otherwise it seems related to this:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352
You could try a newer JVM and see if the problem persists.
On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote:
> HI Support, Swift dev, anyone else reading,
>
> I keep getting this crash on swift jobs submitted from HNL machines
> (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). These
> happen for different workflows, involving different processes. I am
> totally in the dark as to what this error is referring to as well as
> to what may be causing it. This crash has occurred on workflows that
> have just gone 'Active' as well as on workflows that were running for
> hours before crashing.
>
>
> Below is the error message. The log file is too big to attach but can
> be found here:
> /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log
> from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu)
>
>
> Any insight is hugely appreciated - like i said, i don't even know
> what to debug b/c i don't know what the error is referring to.
> Michael
>
>
>
>
>
>
>
> Progress: Submitted:11 Active:1
> Progress: Active:10 Stage out:2
> #
> # An unexpected error has been detected by HotSpot Virtual Machine:
> #
> # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208
> #
> # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed mode,
> sharing)
> # Problematic frame:
> # C [libzip.so+0xfa62]
> #
> # An error report file with more information is saved as
> hs_err_pid32310.log
> #
> # If you would like to submit a bug report, please visit:
> # http://java.sun.com/webapps/bugreport/crash.jsp
> #
> /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 Aborted
> java -Xmx2048M
> -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed
> -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000
> -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/.. -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/.. -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/.. -Djava.security.egd=file:///dev/urandom -Xmx1024m -classpath /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file' '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric'
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From andric at uchicago.edu Thu Jul 23 15:23:03 2009
From: andric at uchicago.edu (Michael Andric)
Date: Thu, 23 Jul 2009 15:23:03 -0500
Subject: [Swift-devel] errors from HNL machines/swift
In-Reply-To: <1248373216.28628.1.camel@localhost>
References:
<1248373216.28628.1.camel@localhost>
Message-ID:
there are a couple here: andrew.bsd.uchicago.edu:/tmp/hs*.log
On Thu, Jul 23, 2009 at 1:20 PM, Mihael Hategan wrote:
> There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to see that.
>
> Otherwise it seems related to this:
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352
>
> You could try a newer JVM and see if the problem persists.
>
> On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote:
> > HI Support, Swift dev, anyone else reading,
> >
> > I keep getting this crash on swift jobs submitted from HNL machines
> > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). These
> > happen for different workflows, involving different processes. I am
> > totally in the dark as to what this error is referring to as well as
> > to what may be causing it. This crash has occurred on workflows that
> > have just gone 'Active' as well as on workflows that were running for
> > hours before crashing.
> >
> >
> > Below is the error message. The log file is too big to attach but can
> > be found here:
> >
> /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log
> > from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu)
> >
> >
> > Any insight is hugely appreciated - like i said, i don't even know
> > what to debug b/c i don't know what the error is referring to.
> > Michael
> >
> >
> >
> >
> >
> >
> >
> > Progress: Submitted:11 Active:1
> > Progress: Active:10 Stage out:2
> > #
> > # An unexpected error has been detected by HotSpot Virtual Machine:
> > #
> > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208
> > #
> > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed mode,
> > sharing)
> > # Problematic frame:
> > # C [libzip.so+0xfa62]
> > #
> > # An error report file with more information is saved as
> > hs_err_pid32310.log
> > #
> > # If you would like to submit a bug report, please visit:
> > # http://java.sun.com/webapps/bugreport/crash.jsp
> > #
> > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 Aborted
> > java -Xmx2048M
> > -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed
> > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000
> > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu-DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/..
> -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/..
> -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/..
> -Djava.security.egd=file:///dev/urandom -Xmx1024m -classpath
> /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar:
> org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file'
> '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric'
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Thu Jul 23 15:33:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 23 Jul 2009 15:33:23 -0500
Subject: [Swift-devel] errors from HNL machines/swift
In-Reply-To:
References:
<1248373216.28628.1.camel@localhost>
Message-ID: <1248381203.32020.0.camel@localhost>
Can't help you much there. It seems to be a bug in the JVM. Again, I'd
try other versions of java.
On Thu, 2009-07-23 at 15:23 -0500, Michael Andric wrote:
> there are a couple here: andrew.bsd.uchicago.edu:/tmp/hs*.log
>
> On Thu, Jul 23, 2009 at 1:20 PM, Mihael Hategan
> wrote:
> There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to
> see that.
>
> Otherwise it seems related to this:
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352
>
> You could try a newer JVM and see if the problem persists.
>
>
> On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote:
> > HI Support, Swift dev, anyone else reading,
> >
> > I keep getting this crash on swift jobs submitted from HNL
> machines
> > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu).
> These
> > happen for different workflows, involving different
> processes. I am
> > totally in the dark as to what this error is referring to as
> well as
> > to what may be causing it. This crash has occurred on
> workflows that
> > have just gone 'Active' as well as on workflows that were
> running for
> > hours before crashing.
> >
> >
> > Below is the error message. The log file is too big to
> attach but can
> > be found here:
> > /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log
> > from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu)
> >
> >
> > Any insight is hugely appreciated - like i said, i don't
> even know
> > what to debug b/c i don't know what the error is referring
> to.
> > Michael
> >
> >
> >
> >
> >
> >
> >
> > Progress: Submitted:11 Active:1
> > Progress: Active:10 Stage out:2
> > #
> > # An unexpected error has been detected by HotSpot Virtual
> Machine:
> > #
> > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208
> > #
> > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed
> mode,
> > sharing)
> > # Problematic frame:
> > # C [libzip.so+0xfa62]
> > #
> > # An error report file with more information is saved as
> > hs_err_pid32310.log
> > #
> > # If you would like to submit a bug report, please visit:
> > # http://java.sun.com/webapps/bugreport/crash.jsp
> > #
> > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310
> Aborted
> > java -Xmx2048M
> >
> -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed
> > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000
> > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu
> -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/..
> -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/..
> -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/..
> -Djava.security.egd=file:///dev/urandom -Xmx1024m
> -classpath /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file' '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric'
>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
From support at ci.uchicago.edu Fri Jul 24 08:50:15 2009
From: support at ci.uchicago.edu (Ti Leggett)
Date: Fri, 24 Jul 2009 08:50:15 -0500
Subject: [Swift-devel] [CI Ticketing System #1372] errors from HNL
machines/swift
In-Reply-To: <1248381203.32020.0.camel@localhost>
References:
<1248373216.28628.1.camel@localhost>
<1248381203.32020.0.camel@localhost>
Message-ID:
Try adding +java-1.6.0_03-sun-r1 above any other lines in your ~/.soft
and run resoft. See if that helps your issues.
On Thu Jul 23 15:33:34 2009, hategan at mcs.anl.gov wrote:
> Can't help you much there. It seems to be a bug in the JVM. Again, I'd
> try other versions of java.
>
> On Thu, 2009-07-23 at 15:23 -0500, Michael Andric wrote:
> > there are a couple here: andrew.bsd.uchicago.edu:/tmp/hs*.log
> >
> > On Thu, Jul 23, 2009 at 1:20 PM, Mihael Hategan
>
> > wrote:
> > There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to
> > see that.
> >
> > Otherwise it seems related to this:
> > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352
> >
> > You could try a newer JVM and see if the problem persists.
> >
> >
> > On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote:
> > > HI Support, Swift dev, anyone else reading,
> > >
> > > I keep getting this crash on swift jobs submitted from HNL
> > machines
> > > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu).
> > These
> > > happen for different workflows, involving different
> > processes. I am
> > > totally in the dark as to what this error is referring to
> as
> > well as
> > > to what may be causing it. This crash has occurred on
> > workflows that
> > > have just gone 'Active' as well as on workflows that were
> > running for
> > > hours before crashing.
> > >
> > >
> > > Below is the error message. The log file is too big to
> > attach but can
> > > be found here:
> > >
>
/gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-
> 20090723-1113-na2cuboc.log
> > > from one of the HNL machines (e.g.,
> gwynn.bsd.uchicago.edu)
> > >
> > >
> > > Any insight is hugely appreciated - like i said, i don't
> > even know
> > > what to debug b/c i don't know what the error is referring
> > to.
> > > Michael
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Progress: Submitted:11 Active:1
> > > Progress: Active:10 Stage out:2
> > > #
> > > # An unexpected error has been detected by HotSpot Virtual
> > Machine:
> > > #
> > > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310,
> tid=2949090208
> > > #
> > > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed
> > mode,
> > > sharing)
> > > # Problematic frame:
> > > # C [libzip.so+0xfa62]
> > > #
> > > # An error report file with more information is saved as
> > > hs_err_pid32310.log
> > > #
> > > # If you would like to submit a bug report, please visit:
> > > # http://java.sun.com/webapps/bugreport/crash.jsp
> > > #
> > > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310
> > Aborted
> > > java -Xmx2048M
> > >
> >
> -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed
> > > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000
> > > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu
> > -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/..
> > -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/..
> > -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/..
> > -Djava.security.egd=file:///dev/urandom -Xmx1024m
> > -classpath
>
/gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-
>
1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-
>
2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-
> url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-
> concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-
> 0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-
> bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-
> abstraction-common-
> 2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-
> axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-
> 0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-
> 080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-
> dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-
> gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-
> coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-
> dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-
> gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-
> gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-
> local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-
> localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-
> provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-
> provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-
> resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-
> svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-
> 1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-
> url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-
>
0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-
> beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-
> collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-
> digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-
> discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-
> httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-
> logging-
>
1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-
>
asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-
> client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-
> stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-
>
utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-
> common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-
> 0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-
> regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-
> webdavlib-
>
2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-
> jdk13-
>
131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-
> lgpl-
>
2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-
> 1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-
> common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-
> factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-
> java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-
>
resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpf
s/pads/fmri/apps/swift/bin/../lib/xml-
>
apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-
> 1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-
> 1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader
> 'tpChiSqTests.swift' '-sites.file'
> '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-
> user=andric'
> >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
>
From skenny at uchicago.edu Fri Jul 24 13:25:01 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Fri, 24 Jul 2009 13:25:01 -0500 (CDT)
Subject: [Swift-devel] remote file mapping
Message-ID: <20090724132501.CAO39274@m4500-02.uchicago.edu>
does anyone know the syntax for mapping a file on a remote
machine? i'm told it's possible but couldn't find it in the doc.
thnx
~sk
From hategan at mcs.anl.gov Fri Jul 24 13:47:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 24 Jul 2009 13:47:58 -0500
Subject: [Swift-devel] remote file mapping
In-Reply-To: <20090724132501.CAO39274@m4500-02.uchicago.edu>
References: <20090724132501.CAO39274@m4500-02.uchicago.edu>
Message-ID: <1248461278.20398.0.camel@localhost>
On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote:
> does anyone know the syntax for mapping a file on a remote
> machine? i'm told it's possible but couldn't find it in the doc.
You should be able to use a URL in any of the mappers. Like
"gsiftp://example.org/file".
From skenny at uchicago.edu Fri Jul 24 15:14:11 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Fri, 24 Jul 2009 15:14:11 -0500 (CDT)
Subject: [Swift-devel] remote file mapping
In-Reply-To: <1248461278.20398.0.camel@localhost>
References: <20090724132501.CAO39274@m4500-02.uchicago.edu>
<1248461278.20398.0.camel@localhost>
Message-ID: <20090724151411.CAO50741@m4500-02.uchicago.edu>
thanks mihael!
so, do you happen to know, would this mean that the gridftp
server on the remote machine is configured to only accept
requests from localhost?
RunID: 20090724-1504-5khkoyd7
Progress:
Execution failed:
java.lang.RuntimeException:
java.lang.RuntimeException: Could not instantiate file resource
Caused by:
Error communicating with the GridFTP server
Caused by:
Authentication failed [Caused by: Operation
unauthorized (Mechanism level: [JGLOBUS-56] Authorization
failed. Expected "/CN=host/localhost.localdomain" target but
received
"/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")]
[skenny at sidgrid urltest]$
---- Original message ----
>Date: Fri, 24 Jul 2009 13:47:58 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] remote file mapping
>To: skenny at uchicago.edu
>Cc: swift-devel at ci.uchicago.edu
>
>On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote:
>> does anyone know the syntax for mapping a file on a remote
>> machine? i'm told it's possible but couldn't find it in the
doc.
>
>You should be able to use a URL in any of the mappers. Like
>"gsiftp://example.org/file".
>
>
From hategan at mcs.anl.gov Fri Jul 24 16:10:55 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 24 Jul 2009 16:10:55 -0500
Subject: [Swift-devel] remote file mapping
In-Reply-To: <20090724151411.CAO50741@m4500-02.uchicago.edu>
References: <20090724132501.CAO39274@m4500-02.uchicago.edu>
<1248461278.20398.0.camel@localhost>
<20090724151411.CAO50741@m4500-02.uchicago.edu>
Message-ID: <1248469855.23180.0.camel@localhost>
On Fri, 2009-07-24 at 15:14 -0500, skenny at uchicago.edu wrote:
> thanks mihael!
>
> so, do you happen to know, would this mean that the gridftp
> server on the remote machine is configured to only accept
> requests from localhost?
Are you submitting from sidgrid.ci.uchicago.edu to
sidgrid.ci.uchicago.edu?
What does your script look like?
>
> RunID: 20090724-1504-5khkoyd7
> Progress:
> Execution failed:
> java.lang.RuntimeException:
> java.lang.RuntimeException: Could not instantiate file resource
> Caused by:
> Error communicating with the GridFTP server
> Caused by:
> Authentication failed [Caused by: Operation
> unauthorized (Mechanism level: [JGLOBUS-56] Authorization
> failed. Expected "/CN=host/localhost.localdomain" target but
> received
> "/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")]
> [skenny at sidgrid urltest]$
>
>
> ---- Original message ----
> >Date: Fri, 24 Jul 2009 13:47:58 -0500
> >From: Mihael Hategan
> >Subject: Re: [Swift-devel] remote file mapping
> >To: skenny at uchicago.edu
> >Cc: swift-devel at ci.uchicago.edu
> >
> >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote:
> >> does anyone know the syntax for mapping a file on a remote
> >> machine? i'm told it's possible but couldn't find it in the
> doc.
> >
> >You should be able to use a URL in any of the mappers. Like
> >"gsiftp://example.org/file".
> >
> >
From skenny at uchicago.edu Fri Jul 24 16:35:09 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Fri, 24 Jul 2009 16:35:09 -0500 (CDT)
Subject: [Swift-devel] remote file mapping
Message-ID: <20090724163509.CAO58555@m4500-02.uchicago.edu>
i'm on sidgrid, trying to gftp a file from
andrew.bsd.uchicago.edu to ranger.
the script looks like this:
file
covMatrix;
Rscript
mxScript;
int totalperms[] = [1:100];
float initweight = .5;
foreach perm in totalperms{
mxModel modmin;
modmin = mxModelProcessor(covMatrix, mxScript, perm,
initweight, "speech");
but this is failing as well:
[skenny at sidgrid urltest]$ globus-url-copy
gsiftp://andrew.bsd.uchicago.edu/tmp/gestspeech.cov
gsiftp://gridftp.ranger.tacc.teragrid.org:2811/guc.test
GlobusUrlCopy error: UrlCopy third party transfer failed.
[Caused by: Connection refused]
---- Original message ----
>Date: Fri, 24 Jul 2009 16:10:55 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] remote file mapping
>To: skenny at uchicago.edu
>Cc: swift-devel at ci.uchicago.edu
>
>On Fri, 2009-07-24 at 15:14 -0500, skenny at uchicago.edu wrote:
>> thanks mihael!
>>
>> so, do you happen to know, would this mean that the gridftp
>> server on the remote machine is configured to only accept
>> requests from localhost?
>
>Are you submitting from sidgrid.ci.uchicago.edu to
>sidgrid.ci.uchicago.edu?
>
>What does your script look like?
>
>>
>> RunID: 20090724-1504-5khkoyd7
>> Progress:
>> Execution failed:
>> java.lang.RuntimeException:
>> java.lang.RuntimeException: Could not instantiate file resource
>> Caused by:
>> Error communicating with the GridFTP server
>> Caused by:
>> Authentication failed [Caused by: Operation
>> unauthorized (Mechanism level: [JGLOBUS-56] Authorization
>> failed. Expected "/CN=host/localhost.localdomain" target but
>> received
>>
"/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")]
>> [skenny at sidgrid urltest]$
>>
>>
>> ---- Original message ----
>> >Date: Fri, 24 Jul 2009 13:47:58 -0500
>> >From: Mihael Hategan
>> >Subject: Re: [Swift-devel] remote file mapping
>> >To: skenny at uchicago.edu
>> >Cc: swift-devel at ci.uchicago.edu
>> >
>> >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote:
>> >> does anyone know the syntax for mapping a file on a remote
>> >> machine? i'm told it's possible but couldn't find it in the
>> doc.
>> >
>> >You should be able to use a URL in any of the mappers. Like
>> >"gsiftp://example.org/file".
>> >
>> >
>
From hategan at mcs.anl.gov Fri Jul 24 16:38:18 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 24 Jul 2009 16:38:18 -0500
Subject: [Swift-devel] remote file mapping
In-Reply-To: <20090724163509.CAO58555@m4500-02.uchicago.edu>
References: <20090724163509.CAO58555@m4500-02.uchicago.edu>
Message-ID: <1248471498.23879.1.camel@localhost>
Wait, wait. Slow down. One problem at a time.
On Fri, 2009-07-24 at 16:35 -0500, skenny at uchicago.edu wrote:
> i'm on sidgrid, trying to gftp a file from
> andrew.bsd.uchicago.edu to ranger.
>
> the script looks like this:
>
> file
> covMatrix;
Why "gsiftp:///" instead of "gsiftp://"?
> Rscript
> mxScript;
>
> int totalperms[] = [1:100];
> float initweight = .5;
> foreach perm in totalperms{
> mxModel modmin file=@strcat("gsiftp:///andrew.ci.uchicago.edu/home/skenny/swift_runs/urltest/results/speech_",perm,".rdata")>;
> modmin = mxModelProcessor(covMatrix, mxScript, perm,
> initweight, "speech");
>
> but this is failing as well:
>
> [skenny at sidgrid urltest]$ globus-url-copy
> gsiftp://andrew.bsd.uchicago.edu/tmp/gestspeech.cov
> gsiftp://gridftp.ranger.tacc.teragrid.org:2811/guc.test
> GlobusUrlCopy error: UrlCopy third party transfer failed.
> [Caused by: Connection refused]
>
>
>
> ---- Original message ----
> >Date: Fri, 24 Jul 2009 16:10:55 -0500
> >From: Mihael Hategan
> >Subject: Re: [Swift-devel] remote file mapping
> >To: skenny at uchicago.edu
> >Cc: swift-devel at ci.uchicago.edu
> >
> >On Fri, 2009-07-24 at 15:14 -0500, skenny at uchicago.edu wrote:
> >> thanks mihael!
> >>
> >> so, do you happen to know, would this mean that the gridftp
> >> server on the remote machine is configured to only accept
> >> requests from localhost?
> >
> >Are you submitting from sidgrid.ci.uchicago.edu to
> >sidgrid.ci.uchicago.edu?
> >
> >What does your script look like?
> >
> >>
> >> RunID: 20090724-1504-5khkoyd7
> >> Progress:
> >> Execution failed:
> >> java.lang.RuntimeException:
> >> java.lang.RuntimeException: Could not instantiate file resource
> >> Caused by:
> >> Error communicating with the GridFTP server
> >> Caused by:
> >> Authentication failed [Caused by: Operation
> >> unauthorized (Mechanism level: [JGLOBUS-56] Authorization
> >> failed. Expected "/CN=host/localhost.localdomain" target but
> >> received
> >>
> "/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")]
> >> [skenny at sidgrid urltest]$
> >>
> >>
> >> ---- Original message ----
> >> >Date: Fri, 24 Jul 2009 13:47:58 -0500
> >> >From: Mihael Hategan
> >> >Subject: Re: [Swift-devel] remote file mapping
> >> >To: skenny at uchicago.edu
> >> >Cc: swift-devel at ci.uchicago.edu
> >> >
> >> >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote:
> >> >> does anyone know the syntax for mapping a file on a remote
> >> >> machine? i'm told it's possible but couldn't find it in the
> >> doc.
> >> >
> >> >You should be able to use a URL in any of the mappers. Like
> >> >"gsiftp://example.org/file".
> >> >
> >> >
> >
From wilde at mcs.anl.gov Sun Jul 26 18:09:59 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 26 Jul 2009 18:09:59 -0500
Subject: [Swift-devel] Re: [Swift-user] XDTM
In-Reply-To:
References:
Message-ID: <4A6CE247.4010105@mcs.anl.gov>
Jamal,
As Swift evolved from its early prototypes to a more mature system, the
notion of XDTM evolved to one of mapping between filesystem-based
structures and Swift in-memory data structures (ie, scalars, arrays, and
structures, which can be nested and typed).
This is best seen by looking at the "external" mapper, which allows a
user to map a dataset using any external program (typically a script)
that returns the members of the dataset as a two-column list: the Swift
variable reference, and the external file or URI.
See the user guide section on the external mapper:
http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.ext_mapper
(but the example in the user guide doesn't show the power of mapping to
nested structures).
In other words, it still has the flavor of XDTM, but without any XML
being visible to the user. It meets the same need but is easier to use
and explain.
- Mike
On 7/26/09 2:50 PM, J A wrote:
> Hi All:
>
> Can any one direct me to a source with more examples/explanation on
> how XDTM is working/implemented?
>
> Thanks,
> Jamal
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From wilde at mcs.anl.gov Mon Jul 27 23:21:52 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 27 Jul 2009 23:21:52 -0500
Subject: [Swift-devel] Swift trunk seems to be broken
Message-ID: <4A6E7CE0.10901@mcs.anl.gov>
This script:
com$ cat >t3.swift
type d {
int x;
}
com$
Gives:
com$ swift t3.swift
Swift svn swift-r3019 cog-r2445
RunID: 20090727-2313-2zka71if
Execution failed:
org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
convert value to number: unbounded
Caused by:
For input string: "unbounded"
com$
com$ java -version
java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode)
com$
Is anyone else seeing this problem?
This fails for me on both communicado and on the BG/P.
On the BG/P I tried with both Java 2.4 and Java 6; both failed the same way.
- Mike
From wilde at mcs.anl.gov Mon Jul 27 23:51:56 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 27 Jul 2009 23:51:56 -0500
Subject: [Swift-devel] Swift trunk seems to be broken
In-Reply-To: <4A6E7CE0.10901@mcs.anl.gov>
References: <4A6E7CE0.10901@mcs.anl.gov>
Message-ID: <4A6E83EC.4070703@mcs.anl.gov>
I think its cog rev 2440 thats causing the problem.
2440 fails:
com$ swift t3.swift
Swift svn swift-r3021 cog-r2440
RunID: 20090727-2333-dpf7v3ze
Execution failed:
org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
convert value to number: unbounded
Caused by:
For input string: "unbounded"
com$ cd -
/home/wilde/swift/src/cog/modules/swift
com$
com$
2339 works:
com$ swift t3.swift
Swift svn swift-r3021 cog-r2439
RunID: 20090727-2337-g19sgr5f
com$
2440 is:
com$ svn diff -r 2439:2440
Index:
modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
===================================================================
---
modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
(revision 2439)
+++
modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
(revision 2440)
@@ -84,7 +84,13 @@
public boolean sys_equals(VariableStack stack) throws
ExecutionException {
Object[] args = getArguments(ARGS_2VALUES, stack);
- return args[0].equals(args[1]);
+ if (args[0] instanceof Number) {
+ Number n2 = TypeUtil.toNumber(args[1]);
+ return ((Number) args[0]).doubleValue() == n2.doubleValue();
+ }
+ else {
+ return args[0].equals(args[1]);
+ }
Exception in log (example) is below.
- Mike
2009-07-27 21:36:28,559-0500 INFO unknown Swift svn swift-r3019 (swift
modified locally) cog-r2445
2009-07-27 21:36:28,561-0500 INFO unknown RUNID
id=tag:benc at ci.uchicago.edu,2007:swift:run:20090727-2136-ibeu4gif
2009-07-27 21:36:28,719-0500 DEBUG VDL2ExecutionContext
org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
convert value to number: unbounded
org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
convert value to number: unbounded
Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
Could not convert value to number: unbounded
at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:61)
at
org.globus.cog.karajan.workflow.nodes.functions.Misc.sys_equals(Misc.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:85)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:58)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:60)
at java.lang.reflect.Method.invoke(Method.java:391)
at
org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:78)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined
Compiled Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled
Code))
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined
Compiled Code))
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled
Code))
Caused by: java.lang.NumberFormatException: For input string: "unbounded"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java(Compiled
Code))
at
java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java(Compiled
Code))
at java.lang.Double.valueOf(Double.java:227)
at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:51)
... 25 more
On 7/27/09 11:21 PM, Michael Wilde wrote:
> This script:
>
> com$ cat >t3.swift
> type d {
> int x;
> }
> com$
>
> Gives:
>
> com$ swift t3.swift
> Swift svn swift-r3019 cog-r2445
>
> RunID: 20090727-2313-2zka71if
> Execution failed:
> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
> convert value to number: unbounded
> Caused by:
> For input string: "unbounded"
> com$
>
>
> com$ java -version
> java version "1.5.0_06"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
> Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode)
> com$
>
>
> Is anyone else seeing this problem?
>
> This fails for me on both communicado and on the BG/P.
> On the BG/P I tried with both Java 2.4 and Java 6; both failed the same
> way.
>
> - Mike
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Jul 28 00:10:59 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 28 Jul 2009 00:10:59 -0500
Subject: [Swift-devel] Swift trunk seems to be broken
In-Reply-To: <4A6E83EC.4070703@mcs.anl.gov>
References: <4A6E7CE0.10901@mcs.anl.gov> <4A6E83EC.4070703@mcs.anl.gov>
Message-ID: <1248757859.24917.0.camel@localhost>
Fixed in cog r2446.
On Mon, 2009-07-27 at 23:51 -0500, Michael Wilde wrote:
> I think its cog rev 2440 thats causing the problem.
>
> 2440 fails:
>
> com$ swift t3.swift
> Swift svn swift-r3021 cog-r2440
>
> RunID: 20090727-2333-dpf7v3ze
> Execution failed:
> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
> convert value to number: unbounded
> Caused by:
> For input string: "unbounded"
> com$ cd -
> /home/wilde/swift/src/cog/modules/swift
> com$
> com$
>
>
> 2339 works:
>
> com$ swift t3.swift
> Swift svn swift-r3021 cog-r2439
>
> RunID: 20090727-2337-g19sgr5f
> com$
>
>
> 2440 is:
>
> com$ svn diff -r 2439:2440
> Index:
> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
> ===================================================================
> ---
> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
> (revision 2439)
> +++
> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
> (revision 2440)
> @@ -84,7 +84,13 @@
>
> public boolean sys_equals(VariableStack stack) throws
> ExecutionException {
> Object[] args = getArguments(ARGS_2VALUES, stack);
> - return args[0].equals(args[1]);
> + if (args[0] instanceof Number) {
> + Number n2 = TypeUtil.toNumber(args[1]);
> + return ((Number) args[0]).doubleValue() == n2.doubleValue();
> + }
> + else {
> + return args[0].equals(args[1]);
> + }
>
> Exception in log (example) is below.
>
> - Mike
>
> 2009-07-27 21:36:28,559-0500 INFO unknown Swift svn swift-r3019 (swift
> modified locally) cog-r2445
>
> 2009-07-27 21:36:28,561-0500 INFO unknown RUNID
> id=tag:benc at ci.uchicago.edu,2007:swift:run:20090727-2136-ibeu4gif
> 2009-07-27 21:36:28,719-0500 DEBUG VDL2ExecutionContext
> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
> convert value to number: unbounded
> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
> convert value to number: unbounded
> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
> Could not convert value to number: unbounded
> at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:61)
> at
> org.globus.cog.karajan.workflow.nodes.functions.Misc.sys_equals(Misc.java:88)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:85)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:58)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:60)
> at java.lang.reflect.Method.invoke(Method.java:391)
> at
> org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:78)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined
> Compiled Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled
> Code))
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined
> Compiled Code))
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled
> Code))
> Caused by: java.lang.NumberFormatException: For input string: "unbounded"
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java(Compiled
> Code))
> at
> java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java(Compiled
> Code))
> at java.lang.Double.valueOf(Double.java:227)
> at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:51)
> ... 25 more
>
>
>
>
>
> On 7/27/09 11:21 PM, Michael Wilde wrote:
> > This script:
> >
> > com$ cat >t3.swift
> > type d {
> > int x;
> > }
> > com$
> >
> > Gives:
> >
> > com$ swift t3.swift
> > Swift svn swift-r3019 cog-r2445
> >
> > RunID: 20090727-2313-2zka71if
> > Execution failed:
> > org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
> > convert value to number: unbounded
> > Caused by:
> > For input string: "unbounded"
> > com$
> >
> >
> > com$ java -version
> > java version "1.5.0_06"
> > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
> > Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode)
> > com$
> >
> >
> > Is anyone else seeing this problem?
> >
> > This fails for me on both communicado and on the BG/P.
> > On the BG/P I tried with both Java 2.4 and Java 6; both failed the same
> > way.
> >
> > - Mike
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Jul 28 00:38:20 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Jul 2009 00:38:20 -0500
Subject: [Swift-devel] Swift trunk seems to be broken
In-Reply-To: <1248757859.24917.0.camel@localhost>
References: <4A6E7CE0.10901@mcs.anl.gov> <4A6E83EC.4070703@mcs.anl.gov>
<1248757859.24917.0.camel@localhost>
Message-ID: <4A6E8ECC.1080102@mcs.anl.gov>
That works - thanks!
Glen, please try your latest OOPS run again now.
- Mike
On 7/28/09 12:10 AM, Mihael Hategan wrote:
> Fixed in cog r2446.
>
> On Mon, 2009-07-27 at 23:51 -0500, Michael Wilde wrote:
>> I think its cog rev 2440 thats causing the problem.
>>
>> 2440 fails:
>>
>> com$ swift t3.swift
>> Swift svn swift-r3021 cog-r2440
>>
>> RunID: 20090727-2333-dpf7v3ze
>> Execution failed:
>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
>> convert value to number: unbounded
>> Caused by:
>> For input string: "unbounded"
>> com$ cd -
>> /home/wilde/swift/src/cog/modules/swift
>> com$
>> com$
>>
>>
>> 2339 works:
>>
>> com$ swift t3.swift
>> Swift svn swift-r3021 cog-r2439
>>
>> RunID: 20090727-2337-g19sgr5f
>> com$
>>
>>
>> 2440 is:
>>
>> com$ svn diff -r 2439:2440
>> Index:
>> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
>> ===================================================================
>> ---
>> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
>> (revision 2439)
>> +++
>> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java
>> (revision 2440)
>> @@ -84,7 +84,13 @@
>>
>> public boolean sys_equals(VariableStack stack) throws
>> ExecutionException {
>> Object[] args = getArguments(ARGS_2VALUES, stack);
>> - return args[0].equals(args[1]);
>> + if (args[0] instanceof Number) {
>> + Number n2 = TypeUtil.toNumber(args[1]);
>> + return ((Number) args[0]).doubleValue() == n2.doubleValue();
>> + }
>> + else {
>> + return args[0].equals(args[1]);
>> + }
>>
>> Exception in log (example) is below.
>>
>> - Mike
>>
>> 2009-07-27 21:36:28,559-0500 INFO unknown Swift svn swift-r3019 (swift
>> modified locally) cog-r2445
>>
>> 2009-07-27 21:36:28,561-0500 INFO unknown RUNID
>> id=tag:benc at ci.uchicago.edu,2007:swift:run:20090727-2136-ibeu4gif
>> 2009-07-27 21:36:28,719-0500 DEBUG VDL2ExecutionContext
>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
>> convert value to number: unbounded
>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
>> convert value to number: unbounded
>> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
>> Could not convert value to number: unbounded
>> at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:61)
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.Misc.sys_equals(Misc.java:88)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:85)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:58)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:60)
>> at java.lang.reflect.Method.invoke(Method.java:391)
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:78)
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
>> at
>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined
>> Compiled Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled
>> Code))
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined
>> Compiled Code))
>> at
>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled
>> Code))
>> Caused by: java.lang.NumberFormatException: For input string: "unbounded"
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java(Compiled
>> Code))
>> at
>> java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java(Compiled
>> Code))
>> at java.lang.Double.valueOf(Double.java:227)
>> at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:51)
>> ... 25 more
>>
>>
>>
>>
>>
>> On 7/27/09 11:21 PM, Michael Wilde wrote:
>>> This script:
>>>
>>> com$ cat >t3.swift
>>> type d {
>>> int x;
>>> }
>>> com$
>>>
>>> Gives:
>>>
>>> com$ swift t3.swift
>>> Swift svn swift-r3019 cog-r2445
>>>
>>> RunID: 20090727-2313-2zka71if
>>> Execution failed:
>>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not
>>> convert value to number: unbounded
>>> Caused by:
>>> For input string: "unbounded"
>>> com$
>>>
>>>
>>> com$ java -version
>>> java version "1.5.0_06"
>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
>>> Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode)
>>> com$
>>>
>>>
>>> Is anyone else seeing this problem?
>>>
>>> This fails for me on both communicado and on the BG/P.
>>> On the BG/P I tried with both Java 2.4 and Java 6; both failed the same
>>> way.
>>>
>>> - Mike
>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From smartin at mcs.anl.gov Tue Jul 28 09:26:30 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Tue, 28 Jul 2009 09:26:30 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To: <4A663FD8.3050909@mcs.anl.gov>
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<4A663FD8.3050909@mcs.anl.gov>
Message-ID:
Hi Mike,
Just following up on this. Will there be some swift use of GRAM5 on
queen bee this week?
-Stu
On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote:
> Yes, there are a few we can run on QueenBee.
>
> Can try to test next week.
>
> Allan, we can test SEE/AMPL, OOPS, and PTMap there.
>
> - Mike
>
>
> On 7/21/09 10:58 AM, Stuart Martin wrote:
>> Are there any swift apps that can use queen bee? There is a GRAM5
>> service setup there for testing.
>> -Stu
>> Begin forwarded message:
>>> From: Stuart Martin
>>> Date: July 21, 2009 10:56:04 AM CDT
>>> To: gateways at teragrid.org
>>> Cc: Stuart Martin , Lukasz Lacinski >> >
>>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>>
>>> Hi Gateways,
>>>
>>> Any gateways that use (or can use) Queen Bee, it would be great if
>>> you could target this new GRAM5 service that Lukasz deployed. I
>>> heard from Lukasz that Jim has submitted a gateway user (SAML) job
>>> and that went through fine and populate the gram audit DB
>>> correctly. Thanks Jim! It would be nice to have some gateway
>>> push the service to test scalability.
>>>
>>> Let us know if you plan to do this.
>>>
>>> Thanks,
>>> Stu
>>>
>>> Begin forwarded message:
>>>
>>>> From: Lukasz Lacinski
>>>> Date: July 21, 2009 1:18:05 AM CDT
>>>> To: gram-user at lists.globus.org
>>>> Subject: [gram-user] GRAM5 Alpha2
>>>>
>>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>>
>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>>
>>>> -seg-module pbs works fine.
>>>> GRAM audit with PostgreSQL works fine.
>>>>
>>>> Can someone submit jobs as a gateway user? I'd like to check if
>>>> the gateway_user field is written to our audit database.
>>>>
>>>> Thanks,
>>>> Lukasz
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Jul 28 10:56:37 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Jul 2009 10:56:37 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To:
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<4A663FD8.3050909@mcs.anl.gov>
Message-ID: <4A6F1FB5.2080207@mcs.anl.gov>
Allan Espinosa will try to test AMPL workflows for the SEE project there
this week.
I may try a few others time permitting, but likely not this week.
Questions, Stu:
- do you want testing through Condor-G with the grid_monitor as well as
native?
- for native testing of GRAM5 (ie through the plain pre-WS GRAM
interface) are then any guidelines for how many jobs we can safely
submit at once, or should we not worry about limits? (ie sending a few
thousand jobs is OK?)
Allan: I just remembered that since Queenbee has 8-core hosts like Abe,
coasters is the only reasonable approach for large-scale testing. But
testing just a few AMPL jobs through plain GRAM5 seems a reasonable step
to do first.
I realize that coaster testing, also, wont give good CPU utilization
until the current "low demand" problem is solved.
- Mike
On 7/28/09 9:26 AM, Stuart Martin wrote:
> Hi Mike,
>
> Just following up on this. Will there be some swift use of GRAM5 on
> queen bee this week?
>
> -Stu
>
> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote:
>
>> Yes, there are a few we can run on QueenBee.
>>
>> Can try to test next week.
>>
>> Allan, we can test SEE/AMPL, OOPS, and PTMap there.
>>
>> - Mike
>>
>>
>> On 7/21/09 10:58 AM, Stuart Martin wrote:
>>> Are there any swift apps that can use queen bee? There is a GRAM5
>>> service setup there for testing.
>>> -Stu
>>> Begin forwarded message:
>>>> From: Stuart Martin
>>>> Date: July 21, 2009 10:56:04 AM CDT
>>>> To: gateways at teragrid.org
>>>> Cc: Stuart Martin , Lukasz Lacinski
>>>>
>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>>>
>>>> Hi Gateways,
>>>>
>>>> Any gateways that use (or can use) Queen Bee, it would be great if
>>>> you could target this new GRAM5 service that Lukasz deployed. I
>>>> heard from Lukasz that Jim has submitted a gateway user (SAML) job
>>>> and that went through fine and populate the gram audit DB
>>>> correctly. Thanks Jim! It would be nice to have some gateway push
>>>> the service to test scalability.
>>>>
>>>> Let us know if you plan to do this.
>>>>
>>>> Thanks,
>>>> Stu
>>>>
>>>> Begin forwarded message:
>>>>
>>>>> From: Lukasz Lacinski
>>>>> Date: July 21, 2009 1:18:05 AM CDT
>>>>> To: gram-user at lists.globus.org
>>>>> Subject: [gram-user] GRAM5 Alpha2
>>>>>
>>>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>>>
>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>>>
>>>>> -seg-module pbs works fine.
>>>>> GRAM audit with PostgreSQL works fine.
>>>>>
>>>>> Can someone submit jobs as a gateway user? I'd like to check if the
>>>>> gateway_user field is written to our audit database.
>>>>>
>>>>> Thanks,
>>>>> Lukasz
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From foster at anl.gov Tue Jul 28 11:09:01 2009
From: foster at anl.gov (Ian Foster)
Date: Tue, 28 Jul 2009 11:09:01 -0500
Subject: [Swift-devel] Functionality request: best effort execution
In-Reply-To:
References:
Message-ID:
I agree that this sort of thing would be of great value for some
applications.
Note that this would make provenance recording more interesting and
important! (As you need to record what happened, not just the input
arguments.)
Ian.
On Jul 14, 2009, at 2:09 AM, Ben Clifford wrote:
>
> One way of putting in ambiguity here is something like the AMB(iguous)
> operator, which looks very similar to Karajan's race behaviour.
>
> a AMB b evaluates to either a or b but its not defined which and
> so the
> runtime can pick which.
>
> That has no particular preference for a result, though in Tibi's use
> case
> one of the results is probably preferred.
>
> You could change the semantics so that it returns a unless a fails in
> which case it evaluates and returns b, unless b fails in which case
> the
> expression fails to evaluate.
>
> Both of the above descriptions can be extended to more than two
> operands
> in a natural way.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From smartin at mcs.anl.gov Tue Jul 28 11:17:23 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Tue, 28 Jul 2009 11:17:23 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To: <4A6F1FB5.2080207@mcs.anl.gov>
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<4A663FD8.3050909@mcs.anl.gov>
<4A6F1FB5.2080207@mcs.anl.gov>
Message-ID: <6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov>
On Jul 28, 2009, at Jul 28, 10:56 AM, Michael Wilde wrote:
> Allan Espinosa will try to test AMPL workflows for the SEE project
> there this week.
>
> I may try a few others time permitting, but likely not this week.
>
> Questions, Stu:
> - do you want testing through Condor-G with the grid_monitor as well
> as native?
I'd say to use GRAM5 as is best for you/your users. We've done some
condor-g testing with and without the grid-monitor. We did with, just
for backward compatibility. But without is recommended. The grid-
monitor is no longer needed with GRAM5.
So, if you have users that use condor-g, then submit GRAM5 jobs with
that. But, turn off using the grid-monitor.
http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_7:_gram5-
condor-g
But if it is "better" to submit them natively, through cog API I
assume(?), then do that.
> - for native testing of GRAM5 (ie through the plain pre-WS GRAM
> interface) are then any guidelines for how many jobs we can safely
> submit at once, or should we not worry about limits? (ie sending a
> few thousand jobs is OK?)
Don't worry about it and submit away. We need to know the limits/
breaking points.
But, to show what we've done in our testing, here are the results from
our 5 client tests (each running in a separate VM) hitting the same
GRAM5 service.
http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_4:_5-client-
seg_2
http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_5:_5-client-
seg-diffusers_2
They submitted 5000 jobs over a 1 hour window to the same GRAM5
service. The load on the head node never went above 4 on the first
and 7 on the second.
>
> Allan: I just remembered that since Queenbee has 8-core hosts like
> Abe, coasters is the only reasonable approach for large-scale
> testing. But testing just a few AMPL jobs through plain GRAM5 seems
> a reasonable step to do first.
>
> I realize that coaster testing, also, wont give good CPU utilization
> until the current "low demand" problem is solved.
>
> - Mike
>
>
> On 7/28/09 9:26 AM, Stuart Martin wrote:
>> Hi Mike,
>> Just following up on this. Will there be some swift use of GRAM5
>> on queen bee this week?
>> -Stu
>> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote:
>>> Yes, there are a few we can run on QueenBee.
>>>
>>> Can try to test next week.
>>>
>>> Allan, we can test SEE/AMPL, OOPS, and PTMap there.
>>>
>>> - Mike
>>>
>>>
>>> On 7/21/09 10:58 AM, Stuart Martin wrote:
>>>> Are there any swift apps that can use queen bee? There is a
>>>> GRAM5 service setup there for testing.
>>>> -Stu
>>>> Begin forwarded message:
>>>>> From: Stuart Martin
>>>>> Date: July 21, 2009 10:56:04 AM CDT
>>>>> To: gateways at teragrid.org
>>>>> Cc: Stuart Martin , Lukasz Lacinski >>>> >
>>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>>>>
>>>>> Hi Gateways,
>>>>>
>>>>> Any gateways that use (or can use) Queen Bee, it would be great
>>>>> if you could target this new GRAM5 service that Lukasz
>>>>> deployed. I heard from Lukasz that Jim has submitted a gateway
>>>>> user (SAML) job and that went through fine and populate the gram
>>>>> audit DB correctly. Thanks Jim! It would be nice to have some
>>>>> gateway push the service to test scalability.
>>>>>
>>>>> Let us know if you plan to do this.
>>>>>
>>>>> Thanks,
>>>>> Stu
>>>>>
>>>>> Begin forwarded message:
>>>>>
>>>>>> From: Lukasz Lacinski
>>>>>> Date: July 21, 2009 1:18:05 AM CDT
>>>>>> To: gram-user at lists.globus.org
>>>>>> Subject: [gram-user] GRAM5 Alpha2
>>>>>>
>>>>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>>>>
>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>>>>
>>>>>> -seg-module pbs works fine.
>>>>>> GRAM audit with PostgreSQL works fine.
>>>>>>
>>>>>> Can someone submit jobs as a gateway user? I'd like to check if
>>>>>> the gateway_user field is written to our audit database.
>>>>>>
>>>>>> Thanks,
>>>>>> Lukasz
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Tue Jul 28 16:06:48 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 28 Jul 2009 16:06:48 -0500
Subject: [Swift-devel] Re: coaster workers not receiving enough jobs
In-Reply-To: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
Message-ID: <50b07b4b0907281406v238f95a0p52c58a65f98ef0af@mail.gmail.com>
Now i tried changing 5mins to 5secs. based on the worker ids with the
pull() calls it looks like all the jobs were successfully assigned to
all workers throughout the blocks.
[aespinosa at communicado coasters]$ grep pull coasters.log | grep -v
Later | awk '{print $5}' | sort -u | cat -n | tail
65 0728-590322-000001:58
66 0728-590322-000001:59
67 0728-590322-000001:6
68 0728-590322-000001:60
69 0728-590322-000001:61
70 0728-590322-000001:62
71 0728-590322-000001:63
72 0728-590322-000001:7
73 0728-590322-000001:8
74 0728-590322-000001:9
My guess is that at long jobs (5 mins), pull() timeouts while waiting
and will only get assigned much later on. But this doesn't happen
because of some timeout mechanisms too (i think).
2009/7/23 Allan Espinosa :
> I tried 0660-many.swift with 200 5min sleep jobs using local:local
> mode (since queue on ranger and teraport takes a while to finish).
> The session spawned 192 workers. ?Swift reports at most 36 active
> processes at a time (which it finished successfully). ?After that
> workers reach idle time exceptions. ? Logs and stuff are in
> ~aespinosa/workflows/coaster_debug/run1/
>
> sites.xml:
>
>
> ?
> ? ?
> ? ?
> ? ?>/home/aespinosa/workflows/coaster_debug/workdir
>
> ? ? ? ? ? ? ? ?10000
> ? ? ? ? ? ? ? ?1.98
>
> ? ?1
> ? ?00:05:00
> ? ?3600
> ?
>
>
>
> swift session:
> Swift svn swift-r3011 cog-r2439
>
> RunID: locallog
> Progress:
> Progress: ?Selecting site:198 ?Initializing site shared directory:1 ?Stage in:1
> Progress: ?Selecting site:1 ?Submitting:198 ?Submitted:1
> Progress: ?Selecting site:1 ?Submitted:198 ?Active:1
> Progress: ?Selecting site:1 ?Submitted:192 ?Active:7
> Progress: ?Selecting site:1 ?Submitted:188 ?Active:11
> Progress: ?Selecting site:1 ?Submitted:181 ?Active:18
> Progress: ?Selecting site:1 ?Submitted:178 ?Active:21
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
>
>
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:36
> Progress: ?Selecting site:1 ?Submitted:163 ?Active:35 ?Checking status:1
> Progress: ?Submitted:156 ?Active:35 ?Checking status:1 ?Finished successfully:8
> Progress: ?Submitted:149 ?Active:34 ?Checking status:1 ?Finished successfully:16
> Progress: ?Submitted:144 ?Active:35 ?Checking status:1 ?Finished successfully:20
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36
> Progress: ?Submitted:133 ?Active:31 ?Finished successfully:36
> Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
> ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
> Progress: ?Stage in:11 ?Submitting:34 ?Submitted:113 ?Active:6
> Finished successfully:36
> Progress: ?Submitted:157 ?Active:7 ?Finished successfully:36
> Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36
> ...
> ... (not yet finished)
>
> $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
> ? ?65 ?2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
> Task(type=JOB_SUBMISSION,
> identity=urn:1248364974288-1248364979260-1248364979261) setting status
> to Active workerid=000055
> ? ?66 ?2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
> Task(type=JOB_SUBMISSION,
> identity=urn:1248364974280-1248364979248-1248364979249) setting status
> to Active workerid=000051
> $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
> ? 61 ?worker-0723-021156-000001.log:1248365290 000054 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> ? ?62 ?worker-0723-021156-000001.log:1248365290 000050 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> ? ?63 ?worker-0723-021156-000001.log:1248365290 000053 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> ? ?64 ?worker-0723-021156-000001.log:1248365290 000052 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> ? ?65 ?worker-0723-021156-000001.log:1248365290 000051 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> ? ?66 ?worker-0723-021156-000001.log:1248365290 000055 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>
>
> it corresponds correctly with the swift session (more or less) since
> we had 30+ completed jobs.
>
> Some lines in coasters.log i find intersting:
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Submitted
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Active
> 2009-07-23 11:12:06,065-0500 INFO ?Command Sending Command(106,
> JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> 2009-07-23 11:12:06,065-0500 INFO ?Command Command(106, JOBSTATUS)
> CMD: Command(106, JOBSTATUS)
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Failed Block ta
> sk failed: 0723-021156-000001Block task ended prematurely
>
> Statement unlikely to be reached at
> /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
> ? ? ? ?(Maybe you meant system() when you said exec()?)
>
>
> 2009-07-23 11:12:06,065-0500 INFO ?Command Sending Command(107,
> JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> 2009-07-23 11:12:06,065-0500 INFO ?Command Command(107, JOBSTATUS)
> CMD: Command(107, JOBSTATUS)
>
>
> -Allan
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Tue Jul 28 16:12:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 28 Jul 2009 16:12:22 -0500
Subject: [Swift-devel] Re: coaster workers not receiving enough jobs
In-Reply-To: <50b07b4b0907281406v238f95a0p52c58a65f98ef0af@mail.gmail.com>
References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com>
<50b07b4b0907281406v238f95a0p52c58a65f98ef0af@mail.gmail.com>
Message-ID: <1248815542.4106.2.camel@localhost>
Right. I put in some quick code to prevent another spin. That likely has
some bugs that causes this problem.
On Tue, 2009-07-28 at 16:06 -0500, Allan Espinosa wrote:
> Now i tried changing 5mins to 5secs. based on the worker ids with the
> pull() calls it looks like all the jobs were successfully assigned to
> all workers throughout the blocks.
>
> [aespinosa at communicado coasters]$ grep pull coasters.log | grep -v
> Later | awk '{print $5}' | sort -u | cat -n | tail
> 65 0728-590322-000001:58
> 66 0728-590322-000001:59
> 67 0728-590322-000001:6
> 68 0728-590322-000001:60
> 69 0728-590322-000001:61
> 70 0728-590322-000001:62
> 71 0728-590322-000001:63
> 72 0728-590322-000001:7
> 73 0728-590322-000001:8
> 74 0728-590322-000001:9
>
>
> My guess is that at long jobs (5 mins), pull() timeouts while waiting
> and will only get assigned much later on. But this doesn't happen
> because of some timeout mechanisms too (i think).
>
> 2009/7/23 Allan Espinosa :
> > I tried 0660-many.swift with 200 5min sleep jobs using local:local
> > mode (since queue on ranger and teraport takes a while to finish).
> > The session spawned 192 workers. Swift reports at most 36 active
> > processes at a time (which it finished successfully). After that
> > workers reach idle time exceptions. Logs and stuff are in
> > ~aespinosa/workflows/coaster_debug/run1/
> >
> > sites.xml:
> >
> >
> >
> >
> >
> > >>/home/aespinosa/workflows/coaster_debug/workdir
> >
> > 10000
> > 1.98
> >
> > 1
> > 00:05:00
> > 3600
> >
> >
> >
> >
> > swift session:
> > Swift svn swift-r3011 cog-r2439
> >
> > RunID: locallog
> > Progress:
> > Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1
> > Progress: Selecting site:1 Submitting:198 Submitted:1
> > Progress: Selecting site:1 Submitted:198 Active:1
> > Progress: Selecting site:1 Submitted:192 Active:7
> > Progress: Selecting site:1 Submitted:188 Active:11
> > Progress: Selecting site:1 Submitted:181 Active:18
> > Progress: Selecting site:1 Submitted:178 Active:21
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> >
> >
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1
> > Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8
> > Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16
> > Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:133 Active:31 Finished successfully:36
> > Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
> > ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
> > Progress: Stage in:11 Submitting:34 Submitted:113 Active:6
> > Finished successfully:36
> > Progress: Submitted:157 Active:7 Finished successfully:36
> > Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > ...
> > ... (not yet finished)
> >
> > $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
> > 65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974288-1248364979260-1248364979261) setting status
> > to Active workerid=000055
> > 66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974280-1248364979248-1248364979249) setting status
> > to Active workerid=000051
> > $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
> > 61 worker-0723-021156-000001.log:1248365290 000054 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 62 worker-0723-021156-000001.log:1248365290 000050 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 63 worker-0723-021156-000001.log:1248365290 000053 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 64 worker-0723-021156-000001.log:1248365290 000052 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 65 worker-0723-021156-000001.log:1248365290 000051 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 66 worker-0723-021156-000001.log:1248365290 000055 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >
> >
> > it corresponds correctly with the swift session (more or less) since
> > we had 30+ completed jobs.
> >
> > Some lines in coasters.log i find intersting:
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Submitted
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Active
> > 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106,
> > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> > 2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS)
> > CMD: Command(106, JOBSTATUS)
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Failed Block ta
> > sk failed: 0723-021156-000001Block task ended prematurely
> >
> > Statement unlikely to be reached at
> > /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
> > (Maybe you meant system() when you said exec()?)
> >
> >
> > 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107,
> > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> > 2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS)
> > CMD: Command(107, JOBSTATUS)
> >
> >
> > -Allan
> >
>
>
>
From wilde at mcs.anl.gov Tue Jul 28 19:26:15 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Jul 2009 19:26:15 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To: <6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov>
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<4A663FD8.3050909@mcs.anl.gov>
<4A6F1FB5.2080207@mcs.anl.gov>
<6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov>
Message-ID: <4A6F9727.9050300@mcs.anl.gov>
Stu,
Glen Hocky has started testing a protein folding app called "OOPS" on
QueenBee under GRAM5. Initial tiny sanity tests look good; we'll move on
to running 100+ job runs, then larger.
We needed to figure out how to get Swift to use all 8 cores of the
QueenBee compute nodes, which we did.
Now we can start scaling up. Glen hopes to test more there shortly.
So far, no problems; no observed differences (in interface) with the new
GRAM.
Any chance of getting GRAM5 on the firefly host at UNL?
- Mike
On 7/28/09 11:17 AM, Stuart Martin wrote:
> On Jul 28, 2009, at Jul 28, 10:56 AM, Michael Wilde wrote:
>
>> Allan Espinosa will try to test AMPL workflows for the SEE project
>> there this week.
>>
>> I may try a few others time permitting, but likely not this week.
>>
>> Questions, Stu:
>> - do you want testing through Condor-G with the grid_monitor as well
>> as native?
>
> I'd say to use GRAM5 as is best for you/your users. We've done some
> condor-g testing with and without the grid-monitor. We did with, just
> for backward compatibility. But without is recommended. The
> grid-monitor is no longer needed with GRAM5.
>
> So, if you have users that use condor-g, then submit GRAM5 jobs with
> that. But, turn off using the grid-monitor.
> http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_7:_gram5-condor-g
>
>
> But if it is "better" to submit them natively, through cog API I
> assume(?), then do that.
>
>> - for native testing of GRAM5 (ie through the plain pre-WS GRAM
>> interface) are then any guidelines for how many jobs we can safely
>> submit at once, or should we not worry about limits? (ie sending a few
>> thousand jobs is OK?)
>
> Don't worry about it and submit away. We need to know the
> limits/breaking points.
>
> But, to show what we've done in our testing, here are the results from
> our 5 client tests (each running in a separate VM) hitting the same
> GRAM5 service.
> http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_4:_5-client-seg_2
>
> http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_5:_5-client-seg-diffusers_2
>
>
> They submitted 5000 jobs over a 1 hour window to the same GRAM5
> service. The load on the head node never went above 4 on the first and
> 7 on the second.
>
>>
>> Allan: I just remembered that since Queenbee has 8-core hosts like
>> Abe, coasters is the only reasonable approach for large-scale testing.
>> But testing just a few AMPL jobs through plain GRAM5 seems a
>> reasonable step to do first.
>>
>> I realize that coaster testing, also, wont give good CPU utilization
>> until the current "low demand" problem is solved.
>>
>> - Mike
>>
>>
>> On 7/28/09 9:26 AM, Stuart Martin wrote:
>>> Hi Mike,
>>> Just following up on this. Will there be some swift use of GRAM5 on
>>> queen bee this week?
>>> -Stu
>>> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote:
>>>> Yes, there are a few we can run on QueenBee.
>>>>
>>>> Can try to test next week.
>>>>
>>>> Allan, we can test SEE/AMPL, OOPS, and PTMap there.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> On 7/21/09 10:58 AM, Stuart Martin wrote:
>>>>> Are there any swift apps that can use queen bee? There is a GRAM5
>>>>> service setup there for testing.
>>>>> -Stu
>>>>> Begin forwarded message:
>>>>>> From: Stuart Martin
>>>>>> Date: July 21, 2009 10:56:04 AM CDT
>>>>>> To: gateways at teragrid.org
>>>>>> Cc: Stuart Martin , Lukasz Lacinski
>>>>>>
>>>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>>>>>
>>>>>> Hi Gateways,
>>>>>>
>>>>>> Any gateways that use (or can use) Queen Bee, it would be great if
>>>>>> you could target this new GRAM5 service that Lukasz deployed. I
>>>>>> heard from Lukasz that Jim has submitted a gateway user (SAML) job
>>>>>> and that went through fine and populate the gram audit DB
>>>>>> correctly. Thanks Jim! It would be nice to have some gateway
>>>>>> push the service to test scalability.
>>>>>>
>>>>>> Let us know if you plan to do this.
>>>>>>
>>>>>> Thanks,
>>>>>> Stu
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>>> From: Lukasz Lacinski
>>>>>>> Date: July 21, 2009 1:18:05 AM CDT
>>>>>>> To: gram-user at lists.globus.org
>>>>>>> Subject: [gram-user] GRAM5 Alpha2
>>>>>>>
>>>>>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>>>>>
>>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>>>>>
>>>>>>> -seg-module pbs works fine.
>>>>>>> GRAM audit with PostgreSQL works fine.
>>>>>>>
>>>>>>> Can someone submit jobs as a gateway user? I'd like to check if
>>>>>>> the gateway_user field is written to our audit database.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lukasz
>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From wilde at mcs.anl.gov Tue Jul 28 19:47:50 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Jul 2009 19:47:50 -0500
Subject: [Swift-devel] Running on multicore hosts
Message-ID: <4A6F9C36.4090209@mcs.anl.gov>
Tibi,
You should be able to do some preliminary tests of your econ app on
QueenBee using GRAM5.
The GRAM contact URIs Stu posted were:
queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
To use all 8 cores of the hosts, turn on Swift clustering.
Then edit libexec/_swiftseq to run all the jobs in a cluster in parallel
rather than serially.
1) add an & to the line where the jobs are exec'ed:
"$EXEC" "${ARGS[@]}" &
2) add a wait at the end of the script:
done
wait
echo `date +%s` DONE >> $WRAPPERLOG
Then turn on clustering. You need to do the math to get a fixed cluster
size of NCPUs, 8 for QueenBee and Abe. 16 for Ranger.
For oops we used:
clustering.enabled=true
clustering.min.time=480
clustering.queue.delay=15
with a GLOBUS::maxwalltime="00:01:00"
This gave clusters of 480/60 = 8, and PBS walltimes of 8 minutes.
To note:
- the site maxwalltime was ignored; Swift calculated the PBS maxwalltime
form the cluster size it built.
- contrary to the user guide, Swift seemed to use
clustering.min.time/(tc.data time)
rather than
(2*clustering.min.time)/(tc.data time)
That needs investigation; it may be a matter of interpretation or may be
describing a case where more jobs could enter the cluster queue before
Swift has a chance to close the cluster.
- When we are more sure this works, we can commit a reference file
_swiftpar to the libexec directory.
- at the moment the simple hack punts on per-job error code return with
the cluster. The sequential cluster script passes on the error code of
the first job in the cluster to fail, and aborts the rest of the
cluster. The heck above treats the cluster as if all jobs succeeded. Im
not sure if the per-job error codes make it back via _swiftwrap. if not,
they could be made to.
In any case, this is at the moment a temporary but simple hack to use
sites with multicore nodes, while coasters is being debugged.
It could readily be generalized though into straightforward direct
support for multicore hosts over GRAM5, PBS, or Condor-G.
- Mike
From smartin at mcs.anl.gov Wed Jul 29 10:25:54 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Wed, 29 Jul 2009 10:25:54 -0500
Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2
In-Reply-To: <4A6F9727.9050300@mcs.anl.gov>
References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov>
<4A663FD8.3050909@mcs.anl.gov>
<4A6F1FB5.2080207@mcs.anl.gov>
<6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov>
<4A6F9727.9050300@mcs.anl.gov>
Message-ID: <367326D5-0888-4F1E-B0CB-5979EB4A3B0E@mcs.anl.gov>
On Jul 28, 2009, at Jul 28, 7:26 PM, Michael Wilde wrote:
> Stu,
>
> Glen Hocky has started testing a protein folding app called "OOPS"
> on QueenBee under GRAM5. Initial tiny sanity tests look good; we'll
> move on to running 100+ job runs, then larger.
>
> We needed to figure out how to get Swift to use all 8 cores of the
> QueenBee compute nodes, which we did.
>
> Now we can start scaling up. Glen hopes to test more there shortly.
>
> So far, no problems; no observed differences (in interface) with the
> new GRAM.
Cool. Let's see how things go as you ramp up. I want to keep track
of GRAM5 application use cases and test results as I get them.
I took a stab at what I think is happening for this OOPS application.
I'm not sure if it is accurate, please take a look.
http://dev.globus.org/wiki/GRAM/GRAM5#Application_Testing
Then I'll need the details of one of the larger test runs that is done.
>
> Any chance of getting GRAM5 on the firefly host at UNL?
Yea, I (we) can ask. But, maybe it makes sense to get some success
with using gram5 on queen bee first and then we ask Brian and say
look, we'd like/need gram5 installed there for testing? Looks like
Firefly is running moab, so it would use the gram pbs adapter like
queen bee.
There is a CMS OSG effort going on now with Igor and Jeff Dost. But
regardless, the more testing/deployments the better.
> - Mike
>
>
> On 7/28/09 11:17 AM, Stuart Martin wrote:
>> On Jul 28, 2009, at Jul 28, 10:56 AM, Michael Wilde wrote:
>>> Allan Espinosa will try to test AMPL workflows for the SEE project
>>> there this week.
>>>
>>> I may try a few others time permitting, but likely not this week.
>>>
>>> Questions, Stu:
>>> - do you want testing through Condor-G with the grid_monitor as
>>> well as native?
>> I'd say to use GRAM5 as is best for you/your users. We've done
>> some condor-g testing with and without the grid-monitor. We did
>> with, just for backward compatibility. But without is
>> recommended. The grid-monitor is no longer needed with GRAM5.
>> So, if you have users that use condor-g, then submit GRAM5 jobs
>> with that. But, turn off using the grid-monitor.
>> http://dev.globus.org/wiki/
>> GRAM5_Scalability_Results#Test_7:_gram5-condor-g But if it is
>> "better" to submit them natively, through cog API I assume(?), then
>> do that.
>>> - for native testing of GRAM5 (ie through the plain pre-WS GRAM
>>> interface) are then any guidelines for how many jobs we can safely
>>> submit at once, or should we not worry about limits? (ie sending a
>>> few thousand jobs is OK?)
>> Don't worry about it and submit away. We need to know the limits/
>> breaking points.
>> But, to show what we've done in our testing, here are the results
>> from our 5 client tests (each running in a separate VM) hitting the
>> same GRAM5 service.
>> http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_4:_5-
>> client-seg_2 http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_5
>> :_5-client-seg-diffusers_2 They submitted 5000 jobs over a 1 hour
>> window to the same GRAM5 service. The load on the head node never
>> went above 4 on the first and 7 on the second.
>>>
>>> Allan: I just remembered that since Queenbee has 8-core hosts like
>>> Abe, coasters is the only reasonable approach for large-scale
>>> testing. But testing just a few AMPL jobs through plain GRAM5
>>> seems a reasonable step to do first.
>>>
>>> I realize that coaster testing, also, wont give good CPU
>>> utilization until the current "low demand" problem is solved.
>>>
>>> - Mike
>>>
>>>
>>> On 7/28/09 9:26 AM, Stuart Martin wrote:
>>>> Hi Mike,
>>>> Just following up on this. Will there be some swift use of GRAM5
>>>> on queen bee this week?
>>>> -Stu
>>>> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote:
>>>>> Yes, there are a few we can run on QueenBee.
>>>>>
>>>>> Can try to test next week.
>>>>>
>>>>> Allan, we can test SEE/AMPL, OOPS, and PTMap there.
>>>>>
>>>>> - Mike
>>>>>
>>>>>
>>>>> On 7/21/09 10:58 AM, Stuart Martin wrote:
>>>>>> Are there any swift apps that can use queen bee? There is a
>>>>>> GRAM5 service setup there for testing.
>>>>>> -Stu
>>>>>> Begin forwarded message:
>>>>>>> From: Stuart Martin
>>>>>>> Date: July 21, 2009 10:56:04 AM CDT
>>>>>>> To: gateways at teragrid.org
>>>>>>> Cc: Stuart Martin , Lukasz Lacinski >>>>>> >
>>>>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2
>>>>>>>
>>>>>>> Hi Gateways,
>>>>>>>
>>>>>>> Any gateways that use (or can use) Queen Bee, it would be
>>>>>>> great if you could target this new GRAM5 service that Lukasz
>>>>>>> deployed. I heard from Lukasz that Jim has submitted a
>>>>>>> gateway user (SAML) job and that went through fine and
>>>>>>> populate the gram audit DB correctly. Thanks Jim! It would
>>>>>>> be nice to have some gateway push the service to test
>>>>>>> scalability.
>>>>>>>
>>>>>>> Let us know if you plan to do this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Stu
>>>>>>>
>>>>>>> Begin forwarded message:
>>>>>>>
>>>>>>>> From: Lukasz Lacinski
>>>>>>>> Date: July 21, 2009 1:18:05 AM CDT
>>>>>>>> To: gram-user at lists.globus.org
>>>>>>>> Subject: [gram-user] GRAM5 Alpha2
>>>>>>>>
>>>>>>>> I've installed GRAM5 Alpha2 on Queen Bee.
>>>>>>>>
>>>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
>>>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
>>>>>>>>
>>>>>>>> -seg-module pbs works fine.
>>>>>>>> GRAM audit with PostgreSQL works fine.
>>>>>>>>
>>>>>>>> Can someone submit jobs as a gateway user? I'd like to check
>>>>>>>> if the gateway_user field is written to our audit database.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Lukasz
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From jamalphd at gmail.com Sun Jul 26 14:50:09 2009
From: jamalphd at gmail.com (J A)
Date: Sun, 26 Jul 2009 19:50:09 -0000
Subject: [Swift-devel] XDTM
Message-ID:
Hi All:
Can any one direct me to a source with more examples/explanation on how XDTM
is working/implemented?
Thanks,
Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From benc at hawaga.org.uk Mon Jul 27 11:27:51 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 27 Jul 2009 16:27:51 -0000
Subject: [Swift-devel] [provenance-challenge] FGCS Special Issue on Using
the Open Provenance
Model to Address Interoperability Challenges (fwd)
Message-ID:
This went to the provenance challenge list - maybe someone is interested.
---------- Forwarded message ----------
Date: Mon, 20 Jul 2009 20:06:01 +0000
From: Yogesh Simmhan
Reply-To: provenance-challenge at ipaw.info
To: "provenance-challenge at ipaw.info"
Subject: [provenance-challenge] FGCS Special Issue on Using the Open Provenance
Model to Address Interoperability Challenges
This is the CfP for the special issue on OPM we discussed at the PC3 workshop. The special issue will appear in J. FGCS. Please consider submitting articles to the issue and also forward the CfP to groups/people who may be interested. PDF/TXT/HTML versions are attached.
Regards,
--Yogesh
________________________________________________________
Yogesh Simmhan/Post Doc Researcher/eScience Group/Microsoft Research
EMail: yoges at microsoft.com WWW: research.microsoft.com/~yoges
Office (LA): 1100 Glendon Ave/Suite 1080, Los Angeles CA 90024
Cell: +1 (540) 449-4770 SF Desk/Fax: +1 (425) 538-6245
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FGCS-OPM-CfP.PDF
Type: application/pdf
Size: 56343 bytes
Desc: FGCS-OPM-CfP.PDF
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: FGCS-OPM-CfP.txt
URL:
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From benc at hawaga.org.uk Tue Jul 28 08:35:19 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 28 Jul 2009 13:35:19 -0000
Subject: [Swift-devel] Re: More questions on Provenance
In-Reply-To: <4A6DEEDF.6050603@purdue.edu>
References: <4A6DEEDF.6050603@purdue.edu>
Message-ID:
Hi Tanu. I'm long gone. But here are a few brief comments. I added
swift-devel.
On Mon, 27 Jul 2009, Tanu Malik wrote:
> 1. How do you model the provenance for across the network transfers?
> In that case the input is some file, the process is the file transfer process
> and the
> output would be on another machine. The output will have to be created
> manually
> which either mentions the success of the transfer or failure.
The level at which provenance is recorded is more abstract than that at
the level where file transfers exist. A procedure takes input files which
are described by URLs relative to the submit-side run directory and
produces output files described by the same.
The internal mechanisms of moving those files around to runtime sites as
needed and managing the cache of those happens internally to the procedure
execution and is not exposed as explicit activity.
Information is logged abut such transfers though so if desired it might be
possible to make another level of description about what happened there
(one of the interesting things with ongoing OPM work is how to describe
the same activity at multiple levels like this).
> 2. Also you mention something about the number of runs in your
> presentation. "extra records ? depth of graph x number of runs". What
> does the number of runs correspond to and how is that modeled in the DB.
This is about constructing an explicit transitive closure of the
procedure/dataset graph.
If you have an explicit graph A->B, B->C then constructing the closure
means you ened to add A->C as an edge. Thats what I mean by roughly
proportional to depth of graph - the deeper the graph, the more edges you
need to add.
In the most recent implementation, each invocation of Swift is a subgraph
disconnected from the subgraphs of all other invocations of Swift. So (if
you make the often invalid but also often valid assumption that each
invocation of Swift generates roughly the same size provenance output),
size of the graph put together is roughly proportional to the number of
runs.
If further work was done to identify datasets from the graphs of different
runs (using some identity relation such as same filename or something
else), then generating a tranistive closure would possibly generate graphs
that are proportional to more-than-the-number-of-runs.
> I was also wondering if we can chat on the phone or I come up again to
> discuss a possible collaboration on this project and present some of our
> new results.
Nothing involving me except by very occasional email or if you hunt me
down in person and ply me with alcohol.
--
From tmalik at purdue.edu Tue Jul 28 11:12:07 2009
From: tmalik at purdue.edu (Tanu Malik)
Date: Tue, 28 Jul 2009 16:12:07 -0000
Subject: [Swift-devel] Re: More questions on Provenance
In-Reply-To:
References: <4A6DEEDF.6050603@purdue.edu>
Message-ID: <4A6F201F.3010205@purdue.edu>
Thanks Ben,
This is very helpful. I wish I could hunt you down.
Interesting to know about the recent OPM work.
We have defined network nodes in our model to explicitly demonstrate those.
I did not know about OPM.
Thanks
Ben Clifford wrote:
> Hi Tanu. I'm long gone. But here are a few brief comments. I added
> swift-devel.
>
> On Mon, 27 Jul 2009, Tanu Malik wrote:
>
>
>> 1. How do you model the provenance for across the network transfers?
>> In that case the input is some file, the process is the file transfer process
>> and the
>> output would be on another machine. The output will have to be created
>> manually
>> which either mentions the success of the transfer or failure.
>>
>
> The level at which provenance is recorded is more abstract than that at
> the level where file transfers exist. A procedure takes input files which
> are described by URLs relative to the submit-side run directory and
> produces output files described by the same.
>
> The internal mechanisms of moving those files around to runtime sites as
> needed and managing the cache of those happens internally to the procedure
> execution and is not exposed as explicit activity.
>
> Information is logged abut such transfers though so if desired it might be
> possible to make another level of description about what happened there
> (one of the interesting things with ongoing OPM work is how to describe
> the same activity at multiple levels like this).
>
>
>> 2. Also you mention something about the number of runs in your
>> presentation. "extra records ? depth of graph x number of runs". What
>> does the number of runs correspond to and how is that modeled in the DB.
>>
>
> This is about constructing an explicit transitive closure of the
> procedure/dataset graph.
>
> If you have an explicit graph A->B, B->C then constructing the closure
> means you ened to add A->C as an edge. Thats what I mean by roughly
> proportional to depth of graph - the deeper the graph, the more edges you
> need to add.
>
> In the most recent implementation, each invocation of Swift is a subgraph
> disconnected from the subgraphs of all other invocations of Swift. So (if
> you make the often invalid but also often valid assumption that each
> invocation of Swift generates roughly the same size provenance output),
> size of the graph put together is roughly proportional to the number of
> runs.
>
> If further work was done to identify datasets from the graphs of different
> runs (using some identity relation such as same filename or something
> else), then generating a tranistive closure would possibly generate graphs
> that are proportional to more-than-the-number-of-runs.
>
>
>> I was also wondering if we can chat on the phone or I come up again to
>> discuss a possible collaboration on this project and present some of our
>> new results.
>>
>
> Nothing involving me except by very occasional email or if you hunt me
> down in person and ply me with alcohol.
>
> --