From yizhu at cs.uchicago.edu Wed Jul 1 04:33:58 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Wed, 01 Jul 2009 04:33:58 -0500 Subject: [Swift-devel] swift error ( gridftp problem) Message-ID: <4A4B2D86.6050608@cs.uchicago.edu> Hi, I have a problem when try running swift on Amazon EC2 with swift on local computer. The EC2 is configured as a globus Installed PBS cluster with one head node and several and shared the /home/ directory via NFS , i've use simpleCA to create a credential for both headnode (host certificate) and user (user certificate). after make simpleCA working, I finally get rid of "Authentication Failure" when running swift, but a new problem occurs; it stuck on "Progress: Initializing site shared directory:1" and finally failed after several try. After that, I checked the "swift workdirectory" and found that new directory has been created with a 0 byte file "_swiftwrap". I also tried run globus-url-copy on client side, it failed with the file named created at remote site but with 0 byte size. It seems that gridftp can successfully create the directory and filename, but can not actually transfer the data. For the firewall setting on EC2, i opened tcp/udp 2119 (gridftp), tcp/udp 2811(gram2), tcp/udp 8443 (gram4), (ssh), (https), (http). -Yi [1] Swift failed -bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml first.swift Swift 0.9 swift-r2860 cog-r2388 RunID: 20090701-0344-zn2a66ub Progress: Progress: Initializing site shared directory:1 Progress: Initializing site shared directory:1 Progress: Initializing site shared directory:1 Progress: Initializing site shared directory:1 Progress: Initializing site shared directory:1 Progress: Failed:1 Execution failed: Could not initialize shared directory on ec2_basecluster Caused by: Reply wait timeout. (error code 4) -bash-3.2$ [2] Grid-ftp-failed -bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com/rec_data.txt -bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com:2811/home/torqueuser/rec_data.txt GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed. : globus_gridftp_server_file.c:globus_l_gfs_file_recv:1770: 500-globus_l_gfs_file_open failed. 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694: 500-globus_xio_register_open failed. 500-globus_xio_file_driver.c:globus_l_xio_file_open:438: 500-Unable to open file /home/torqueuser/home/torqueuser/rec_data.txt 500-globus_xio_file_driver.c:globus_l_xio_file_open:381: 500-System error in open: No such file or directory 500-globus_xio: A system call failed: No such file or directory 500 End.]] -bash-3.2$ -bash-3.2$ [3]-bash-3.2$ cat tc.test.data ... ... ec2_basecluster echo /bin/echo INSTALLED INTEL32::LINUX null ec2_basecluster cat /bin/cat INSTALLED INTEL32::LINUX null ec2_basecluster ls /bin/ls INSTALLED INTEL32::LINUX null ec2_basecluster grep /bin/grep INSTALLED INTEL32::LINUX null ec2_basecluster sort /bin/sort INSTALLED INTEL32::LINUX null ec2_basecluster paste /bin/paste INSTALLED INTEL32::LINUX null ec2_basecluster wc /bin/wc INSTALLED INTEL32::LINUX null ec2_basecluster touch /bin/touch INSTALLED INTEL32::LINUX null ec2_basecluster sleep /bin/sleep INSTALLED INTEL32::LINUX null ... ... [4] -bash-3.2$ cat sites.test.xml ... /home/torqueuser ... [5] debug version of swift run -bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml first.swift -debug Max heap: 268435456 kmlversion is >85d4b03e-7b73-49b7-81aa-096255181491< build version is >85d4b03e-7b73-49b7-81aa-096255181491< Recompilation suppressed. Stack dump: Level 1 [iA = 0, iB = 0, bA = false, bB = false] vdl:instanceconfig = Swift configuration [] vdl:operation = run vds.home = /home/yizhu/swift-0.9/bin/.. Using sites file: ../sites.test.xml Using tc.data: ../tc.test.data Setting resources to: {ec2_basecluster=ec2_basecluster} Swift 0.9 swift-r2860 cog-r2388 Swift 0.9 swift-r2860 cog-r2388 RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090701-0348-vrb1yxl6 RunID: 20090701-0348-vrb1yxl6 closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 type string value=Hello, world! dataset=unnamed SwiftScript value (closed) ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 path=$ VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 VALUE=Hello, world! NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002 type messagefile with no value at dataset=outfile (not closed).$ NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002 Progress: PROCEDURE line=3 thread=0 name=greeting PARAM thread=0 direction=output variable=t provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002 closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003 type string value=hello.txt dataset=unnamed SwiftScript value (closed) ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003 path=$ VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003 VALUE=hello.txt START thread=0 tr=echo Sorted: [ec2_basecluster:0.000(1.000):0/1 overload: 0] Rand: 0.8176156212454151, sum: 1.0 Next contact: ec2_basecluster:0.000(1.000):0/1 overload: 0 START host=ec2_basecluster - Initializing shared directory multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, -0.01) Old score: 0.000, new score: -0.010 No global submit throttle set. Using default (100) Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status to Completed multiplyScore(ec2_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) Old score: -0.010, new score: 0.000 multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, 0.1) Old score: 0.000, new score: 0.100 Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) Completed. Waiting: 0, Running: 0. Heap size: 64M, Heap free: 30M, Max heap: 256M multiplyScore(ec2_basecluster:0.100(1.060):1/1 overload: 0, -0.2) Old score: 0.100, new score: -0.100 Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status to Submitting Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status to Submitted Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status to Active Progress: Initializing site shared directory:1 Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status to Failed null multiplyScore(ec2_basecluster:-0.100(0.943):1/1 overload: 0, -0.5) Old score: -0.100, new score: -0.600 Releasing contact 2 commitDelayedScore(ec2_basecluster:-0.600(0.705):0/1 overload: 0, 0.1 Sorted: [ec2_basecluster:-0.500(0.747):0/1 overload: 0] Rand: 0.4103224563240889, sum: 1.0 Next contact: ec2_basecluster:-0.500(0.747):0/1 overload: 0 Progress: Initializing site shared directory:1 START host=ec2_basecluster - Initializing shared directory multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: -140, -0.01) Old score: -0.500, new score: -0.510 Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status to Completed multiplyScore(ec2_basecluster:-0.510(0.742):1/1 overload: 0, 0.01) Old score: -0.510, new score: -0.500 multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: 0, 0.1) Old score: -0.500, new score: -0.400 Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) Completed. Waiting: 0, Running: 0. Heap size: 64M, Heap free: 28M, Max heap: 256M multiplyScore(ec2_basecluster:-0.400(0.791):1/1 overload: 0, -0.2) Old score: -0.400, new score: -0.600 Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status to Submitting Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status to Submitted Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status to Active Progress: Initializing site shared directory:1 Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status to Failed null multiplyScore(ec2_basecluster:-0.600(0.705):1/1 overload: 0, -0.5) Old score: -0.600, new score: -1.100 Releasing contact 3 commitDelayedScore(ec2_basecluster:-1.100(0.530):0/1 overload: 0, 0.1 Sorted: [ec2_basecluster:-1.000(0.561):0/1 overload: 0] Rand: 0.653323366777857, sum: 1.0 Next contact: ec2_basecluster:-1.000(0.561):0/1 overload: 0 Progress: Initializing site shared directory:1 START host=ec2_basecluster - Initializing shared directory multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: -199, -0.01) Old score: -1.000, new score: -1.010 Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status to Completed multiplyScore(ec2_basecluster:-1.010(0.557):1/1 overload: 0, 0.01) Old score: -1.010, new score: -1.000 multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: 0, 0.1) Old score: -1.000, new score: -0.900 Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) Completed. Waiting: 0, Running: 0. Heap size: 64M, Heap free: 27M, Max heap: 256M multiplyScore(ec2_basecluster:-0.900(0.593):1/1 overload: 0, -0.2) Old score: -0.900, new score: -1.100 Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status to Submitting Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status to Submitted Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status to Active Progress: Initializing site shared directory:1 Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status to Failed null multiplyScore(ec2_basecluster:-1.100(0.530):1/1 overload: 0, -0.5) Old score: -1.100, new score: -1.600 Releasing contact 4 commitDelayedScore(ec2_basecluster:-1.600(0.403):0/1 overload: 0, 0.1 END_FAILURE thread=0 tr=echo Progress: Failed:1 Could not initialize shared directory on ec2_basecluster Could not initialize shared directory on ec2_basecluster Caused by: null Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) Caused by: null Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151) at org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314) at org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) at org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) at org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656) at org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421) at org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) at org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) at org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505) at java.lang.Thread.run(Thread.java:595) Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException at org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44) at org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33) at org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) at org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) ... 1 more Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511) at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129) ... 1 more Execution failed: Could not initialize shared directory on ec2_basecluster Caused by: Reply wait timeout. (error code 4) Detailed exception: Could not initialize shared directory on ec2_basecluster Caused by: null Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) Caused by: null Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151) at org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314) at org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) at org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) at org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656) at org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421) at org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) at org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) at org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505) at java.lang.Thread.run(Thread.java:595) Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException at org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44) at org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33) at org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) at org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) ... 1 more Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511) at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129) ... 1 more Swift finished with errors -bash-3.2$ -bash-3.2$ From wilde at mcs.anl.gov Wed Jul 1 07:00:07 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 01 Jul 2009 07:00:07 -0500 Subject: [Swift-devel] swift error ( gridftp problem) In-Reply-To: <4A4B2D86.6050608@cs.uchicago.edu> References: <4A4B2D86.6050608@cs.uchicago.edu> Message-ID: <4A4B4FC7.5090607@mcs.anl.gov> Yi, I dont have an answer for you, but it certainly seems to be a problem at the GridFTP level, not a Swift problem. Do you have GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE set in your client environment (ie on the "local computer")? From that local computer, with an ordinary (e.g., DOEGrids or NCSA) certificate, can you access files on for example TeraPort? - Mike On 7/1/09 4:33 AM, yizhu wrote: > Hi, > > I have a problem when try running swift on Amazon EC2 with swift on > local computer. > > The EC2 is configured as a globus Installed PBS cluster with one head > node and several and shared the /home/ directory via NFS , i've use > simpleCA to create a credential for both headnode (host certificate) and > user (user certificate). > > after make simpleCA working, I finally get rid of "Authentication > Failure" when running swift, but a new problem occurs; it stuck on > "Progress: Initializing site shared directory:1" and finally failed > after several try. After that, I checked the "swift workdirectory" and > found that new directory has been created with a 0 byte file "_swiftwrap". > > I also tried run globus-url-copy on client side, it failed with the file > named created at remote site but with 0 byte size. It seems that gridftp > can successfully create the directory and filename, but can not actually > transfer the data. > > For the firewall setting on EC2, i opened tcp/udp 2119 (gridftp), > tcp/udp 2811(gram2), tcp/udp 8443 (gram4), (ssh), (https), (http). > > > -Yi > [1] Swift failed > -bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml > first.swift > Swift 0.9 swift-r2860 cog-r2388 > > RunID: 20090701-0344-zn2a66ub > Progress: > Progress: Initializing site shared directory:1 > Progress: Initializing site shared directory:1 > Progress: Initializing site shared directory:1 > Progress: Initializing site shared directory:1 > Progress: Initializing site shared directory:1 > Progress: Failed:1 > Execution failed: > Could not initialize shared directory on ec2_basecluster > Caused by: > Reply wait timeout. (error code 4) > -bash-3.2$ > > [2] Grid-ftp-failed > -bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt > gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com/rec_data.txt > -bash-3.2$ globus-url-copy file:////home/yizhu/firstswift/hello.txt > gsiftp://ec2-174-129-90-225.compute-1.amazonaws.com:2811/home/torqueuser/rec_data.txt > > GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: Server refused > performing the request. Custom message: (error code 1) [Nested > exception message: Custom message: Unexpected reply: 500-Command > failed. : globus_gridftp_server_file.c:globus_l_gfs_file_recv:1770: > 500-globus_l_gfs_file_open failed. > 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694: > 500-globus_xio_register_open failed. > 500-globus_xio_file_driver.c:globus_l_xio_file_open:438: > 500-Unable to open file /home/torqueuser/home/torqueuser/rec_data.txt > 500-globus_xio_file_driver.c:globus_l_xio_file_open:381: > 500-System error in open: No such file or directory > 500-globus_xio: A system call failed: No such file or directory > 500 End.]] > -bash-3.2$ > -bash-3.2$ > > > [3]-bash-3.2$ cat tc.test.data > > ... > ... > > ec2_basecluster echo /bin/echo INSTALLED > INTEL32::LINUX null > ec2_basecluster cat /bin/cat INSTALLED > INTEL32::LINUX null > ec2_basecluster ls /bin/ls INSTALLED > INTEL32::LINUX null > ec2_basecluster grep /bin/grep INSTALLED > INTEL32::LINUX null > ec2_basecluster sort /bin/sort INSTALLED > INTEL32::LINUX null > ec2_basecluster paste /bin/paste INSTALLED > INTEL32::LINUX null > ec2_basecluster wc /bin/wc INSTALLED > INTEL32::LINUX null > ec2_basecluster touch /bin/touch INSTALLED > INTEL32::LINUX null > ec2_basecluster sleep /bin/sleep INSTALLED > INTEL32::LINUX null > > ... > ... > > [4] -bash-3.2$ cat sites.test.xml > > > ... > > > > > url="ec2-174-129-90-225.compute-1.amazonaws.com/jobmanager-pbs" > major="2" /> > /home/torqueuser > > ... > > > [5] debug version of swift run > -bash-3.2$ swift -tc.file ../tc.test.data -sites.file ../sites.test.xml > first.swift -debug > Max heap: 268435456 > kmlversion is >85d4b03e-7b73-49b7-81aa-096255181491< > build version is >85d4b03e-7b73-49b7-81aa-096255181491< > Recompilation suppressed. > Stack dump: > Level 1 > [iA = 0, iB = 0, bA = false, bB = false] > vdl:instanceconfig = Swift configuration [] > vdl:operation = run > vds.home = /home/yizhu/swift-0.9/bin/.. > > > Using sites file: ../sites.test.xml > Using tc.data: ../tc.test.data > Setting resources to: {ec2_basecluster=ec2_basecluster} > Swift 0.9 swift-r2860 cog-r2388 > > Swift 0.9 swift-r2860 cog-r2388 > > RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090701-0348-vrb1yxl6 > RunID: 20090701-0348-vrb1yxl6 > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 > type string value=Hello, world! dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 > VALUE=Hello, world! > NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000001 > > Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002 > type messagefile with no value at dataset=outfile (not closed).$ > NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002 > > Progress: > PROCEDURE line=3 thread=0 name=greeting > PARAM thread=0 direction=output variable=t > provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000002 > > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003 > type string value=hello.txt dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090701-0348-xecqa0zc:720000000003 > VALUE=hello.txt > START thread=0 tr=echo > Sorted: [ec2_basecluster:0.000(1.000):0/1 overload: 0] > Rand: 0.8176156212454151, sum: 1.0 > Next contact: ec2_basecluster:0.000(1.000):0/1 overload: 0 > START host=ec2_basecluster - Initializing shared directory > multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, -0.01) > Old score: 0.000, new score: -0.010 > No global submit throttle set. Using default (100) > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) setting status > to Completed > multiplyScore(ec2_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) > Old score: -0.010, new score: 0.000 > multiplyScore(ec2_basecluster:0.000(1.000):1/1 overload: 0, 0.1) > Old score: 0.000, new score: 0.100 > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105282) Completed. > Waiting: 0, Running: 0. Heap size: 64M, Heap free: 30M, Max heap: 256M > multiplyScore(ec2_basecluster:0.100(1.060):1/1 overload: 0, -0.2) > Old score: 0.100, new score: -0.100 > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status > to Active > > > > > > > > > Progress: Initializing site shared directory:1 > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105285) setting status > to Failed null > multiplyScore(ec2_basecluster:-0.100(0.943):1/1 overload: 0, -0.5) > Old score: -0.100, new score: -0.600 > Releasing contact 2 > commitDelayedScore(ec2_basecluster:-0.600(0.705):0/1 overload: 0, 0.1 > Sorted: [ec2_basecluster:-0.500(0.747):0/1 overload: 0] > Rand: 0.4103224563240889, sum: 1.0 > Next contact: ec2_basecluster:-0.500(0.747):0/1 overload: 0 > Progress: Initializing site shared directory:1 > START host=ec2_basecluster - Initializing shared directory > multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: -140, -0.01) > Old score: -0.500, new score: -0.510 > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) setting status > to Completed > multiplyScore(ec2_basecluster:-0.510(0.742):1/1 overload: 0, 0.01) > Old score: -0.510, new score: -0.500 > multiplyScore(ec2_basecluster:-0.500(0.747):1/1 overload: 0, 0.1) > Old score: -0.500, new score: -0.400 > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105288) Completed. > Waiting: 0, Running: 0. Heap size: 64M, Heap free: 28M, Max heap: 256M > multiplyScore(ec2_basecluster:-0.400(0.791):1/1 overload: 0, -0.2) > Old score: -0.400, new score: -0.600 > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status > to Active > > > Progress: Initializing site shared directory:1 > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105291) setting status > to Failed null > multiplyScore(ec2_basecluster:-0.600(0.705):1/1 overload: 0, -0.5) > Old score: -0.600, new score: -1.100 > Releasing contact 3 > commitDelayedScore(ec2_basecluster:-1.100(0.530):0/1 overload: 0, 0.1 > Sorted: [ec2_basecluster:-1.000(0.561):0/1 overload: 0] > Rand: 0.653323366777857, sum: 1.0 > Next contact: ec2_basecluster:-1.000(0.561):0/1 overload: 0 > Progress: Initializing site shared directory:1 > START host=ec2_basecluster - Initializing shared directory > multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: -199, -0.01) > Old score: -1.000, new score: -1.010 > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) setting status > to Completed > multiplyScore(ec2_basecluster:-1.010(0.557):1/1 overload: 0, 0.01) > Old score: -1.010, new score: -1.000 > multiplyScore(ec2_basecluster:-1.000(0.561):1/1 overload: 0, 0.1) > Old score: -1.000, new score: -0.900 > Task(type=FILE_OPERATION, identity=urn:0-1-1246438105294) Completed. > Waiting: 0, Running: 0. Heap size: 64M, Heap free: 27M, Max heap: 256M > multiplyScore(ec2_basecluster:-0.900(0.593):1/1 overload: 0, -0.2) > Old score: -0.900, new score: -1.100 > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status > to Active > Progress: Initializing site shared directory:1 > Task(type=FILE_TRANSFER, identity=urn:0-1-1246438105297) setting status > to Failed null > multiplyScore(ec2_basecluster:-1.100(0.530):1/1 overload: 0, -0.5) > Old score: -1.100, new score: -1.600 > Releasing contact 4 > commitDelayedScore(ec2_basecluster:-1.600(0.403):0/1 overload: 0, 0.1 > END_FAILURE thread=0 tr=echo > Progress: Failed:1 > Could not initialize shared directory on ec2_basecluster > Could not initialize shared directory on ec2_basecluster > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException > Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. > (error code 4) > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException > Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. > (error code 4) > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151) > > at > org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314) > > at > org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > > at > org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > > at > org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656) > > at > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421) > > at > org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) > > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) > > at > org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505) > > at java.lang.Thread.run(Thread.java:595) > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException > at > org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44) > > at > org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33) > > at > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) > > at > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) > > ... 1 more > Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. > (error code 4) > at > org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511) > > at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129) > ... 1 more > Execution failed: > Could not initialize shared directory on ec2_basecluster > Caused by: > Reply wait timeout. (error code 4) > Detailed exception: > Could not initialize shared directory on ec2_basecluster > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException > Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. > (error code 4) > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException > Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. > (error code 4) > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151) > > at > org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.taskFailed(AbstractGridNode.java:314) > > at > org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > > at > org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > > at > org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:656) > > at > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421) > > at > org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) > > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) > > at > org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.transferFailed(DelegatedFileTransferHandler.java:581) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:505) > > at java.lang.Thread.run(Thread.java:595) > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException > at > org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:44) > > at > org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTPFileResource.java:33) > > at > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:430) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) > > at > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) > > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) > > ... 1 more > Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. > (error code 4) > at > org.globus.ftp.vanilla.FTPServerFacade$LocalControlChannel.waitFor(FTPServerFacade.java:511) > > at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:129) > ... 1 more > Swift finished with errors > -bash-3.2$ > -bash-3.2$ > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Jul 1 07:26:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 1 Jul 2009 12:26:11 +0000 (GMT) Subject: [Swift-devel] swift error ( gridftp problem) In-Reply-To: <4A4B2D86.6050608@cs.uchicago.edu> References: <4A4B2D86.6050608@cs.uchicago.edu> Message-ID: this is almost definitely a firewall problem, with you not having the correct ports for gridftp data channels open. read this: http://dev.globus.org/wiki/FirewallHowTo You need to configure an ephemeral port range in your firewall, of maybe 1000 ports, and declare it in the GLOBUS_TCP_PORT_RANGE for your server, as described here: http://dev.globus.org/wiki/FirewallHowTo#Configuring_GridFTP_to_use_GLOBUS_TCP_PORT_RANGE Make sure you can transfer a file with globus-url-copy before attempting to run Swift. This is not a swift-specific problem. -- From benc at hawaga.org.uk Wed Jul 1 10:37:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 1 Jul 2009 15:37:28 +0000 (GMT) Subject: [Swift-devel] writeData Message-ID: r2994 contains a writeData function which does the opposite of readData. Specifically, you can say: file l; l = writeData(@f); to output the filenames for a data structure into a text file, so that you can pass this instead of passing filenames on the command line. -- From yizhu at cs.uchicago.edu Wed Jul 1 15:42:31 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Wed, 01 Jul 2009 15:42:31 -0500 Subject: [Swift-devel] swift error ( gridftp problem) In-Reply-To: References: <4A4B2D86.6050608@cs.uchicago.edu> Message-ID: <4A4BCA37.8070104@cs.uchicago.edu> Yup, it's my firewall setting problem, now it works, Thanks. -Yi Ben Clifford wrote: > this is almost definitely a firewall problem, with you not having the > correct ports for gridftp data channels open. > > read this: > > http://dev.globus.org/wiki/FirewallHowTo > > You need to configure an ephemeral port range in your firewall, of maybe > 1000 ports, and declare it in the GLOBUS_TCP_PORT_RANGE for your server, > as described here: > > http://dev.globus.org/wiki/FirewallHowTo#Configuring_GridFTP_to_use_GLOBUS_TCP_PORT_RANGE > > Make sure you can transfer a file with globus-url-copy before attempting > to run Swift. > > This is not a swift-specific problem. > From rynge at renci.org Thu Jul 2 11:19:02 2009 From: rynge at renci.org (Mats Rynge) Date: Thu, 02 Jul 2009 12:19:02 -0400 Subject: [Swift-devel] Patch for swift-osg-ress-site-catalog Message-ID: <4A4CDDF6.9030101@renci.org> Swift developers, Attached is a patch for the swift-osg-ress-site-catalog tool, with a fix for sites having multiple gatekeepers advertised under the same site name. -- Mats Rynge Renaissance Computing Institute -------------- next part -------------- A non-text attachment was scrubbed... Name: swift-osg-ress-site-catalog.patch Type: text/x-diff Size: 1804 bytes Desc: not available URL: From aespinosa at cs.uchicago.edu Thu Jul 2 14:32:30 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 2 Jul 2009 14:32:30 -0500 Subject: [Swift-devel] workers not initiated on all nodes/cpus in a block Message-ID: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com> looking at the submit script before, even though the coaster block requested for 8 nodes, it still simply runs 1 worker submit script found: cat PBS2252235058660926788.submit #PBS -S /bin/sh #PBS -N null #PBS -m n #PBS -l nodes=8 #PBS -l walltime=00:04:00 #PBS -q short #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl http://128.135.125.116:47679 0702-050234-000004 1 /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode the /usr/bin/perl line should be prepended with "pbdsh" or other equivalent utilities to execute the script on all nodes/cpus. i think this is the reason why in some instances the block requests more nodes but not all are active. host information: [aespinosa at communicado ~]$ screen -r IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '1122120' (-00:05:07 -> 00:22:53 Duration: 00:28:00) PE: 8.00 StartPriority: 1800 [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x Password: PID TTY STAT TIME COMMAND 31815 ? Ss 0:00 -sh 32054 ? S 0:00 pbs_demux 32229 ? S 0:00 -sh 32230 ? S 0:00 /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl http://128.135.125.116:47679 0702-050234-000003 1 32231 ? S 0:00 /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl http://128.135.125.116:47679 0702-050234-000003 1 32233 ? S 0:00 /bin/bash /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out result/run1416/stdout -err stderr.txt -i -d |subproblems|result/run1416 -if template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout -k -status files -a run1416 template armington.mod armington_process.cmd armington_output.cmd subproblems/producer_tree.mod ces.so 32256 ? S 0:00 /bin/bash /home/zzhang/SEE/static/run_ampl run1416 template armington.mod armington_process.cmd armington_output.cmd subproblems/producer_tree.mod ces.so 32258 ? S 0:19 ampl arm_test.cmd 32716 ? R 0:37 pathampl /tmp/at32258 -AMPL 32726 ? S 0:00 sshd: aespinosa at notty 32727 ? Rs 0:00 ps x [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x Password: PID TTY STAT TIME COMMAND 30721 ? S 0:00 sshd: aespinosa at pts/0 30722 pts/0 Ss 0:00 -bash 30951 pts/0 S+ 0:00 ssh tp-c105 ps x 30955 ? S 0:00 sshd: aespinosa at notty 30956 ? Rs 0:00 ps x [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x The authenticity of host 'tp-c102 (10.135.125.108)' can't be established. RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list of known hosts. Password: PID TTY STAT TIME COMMAND 10274 ? S 0:00 sshd: aespinosa at notty 10275 ? Rs 0:00 ps x ... ... swift session snapshot: Progress: Selecting site:1014 Submitted:8 Active:1 Progress: Selecting site:1014 Submitted:8 Active:1 Progress: Selecting site:1014 Submitted:8 Active:1 Progress: Selecting site:1014 Submitted:8 Active:1 queue information: ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 1122120 aespinosa Running 8 00:19:53 Thu Jul 2 14:22:19 1 Active Job 171 of 200 Processors Active (85.50%) 100 of 100 Nodes Active (100.00%) -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Thu Jul 2 14:39:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 02 Jul 2009 14:39:03 -0500 Subject: [Swift-devel] workers not initiated on all nodes/cpus in a block In-Reply-To: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com> References: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com> Message-ID: <1246563543.4778.0.camel@localhost> This is with the PBS provider rather than Globus, right? On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote: > looking at the submit script before, even though the coaster block > requested for 8 nodes, it still simply runs 1 worker > > submit script found: > cat PBS2252235058660926788.submit > #PBS -S /bin/sh > #PBS -N null > #PBS -m n > #PBS -l nodes=8 > #PBS -l walltime=00:04:00 > #PBS -q short > #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout > #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr > /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl > http://128.135.125.116:47679 0702-050234-000004 1 > /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode > > > the /usr/bin/perl line should be prepended with "pbdsh" or other > equivalent utilities to execute the script on all nodes/cpus. i think > this is the reason why in some instances the block requests more nodes > but not all are active. > > host information: > [aespinosa at communicado ~]$ screen -r > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '1122120' (-00:05:07 -> 00:22:53 Duration: 00:28:00) > PE: 8.00 StartPriority: 1800 > > [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x > Password: > PID TTY STAT TIME COMMAND > 31815 ? Ss 0:00 -sh > 32054 ? S 0:00 pbs_demux > 32229 ? S 0:00 -sh > 32230 ? S 0:00 /usr/bin/perl > /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl > http://128.135.125.116:47679 0702-050234-000003 1 > 32231 ? S 0:00 /usr/bin/perl > /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl > http://128.135.125.116:47679 0702-050234-000003 1 > 32233 ? S 0:00 /bin/bash > /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap > run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out > result/run1416/stdout -err stderr.txt -i -d > |subproblems|result/run1416 -if > template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so > -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout > -k -status files -a run1416 template armington.mod > armington_process.cmd armington_output.cmd > subproblems/producer_tree.mod ces.so > 32256 ? S 0:00 /bin/bash /home/zzhang/SEE/static/run_ampl > run1416 template armington.mod armington_process.cmd > armington_output.cmd subproblems/producer_tree.mod ces.so > 32258 ? S 0:19 ampl arm_test.cmd > 32716 ? R 0:37 pathampl /tmp/at32258 -AMPL > 32726 ? S 0:00 sshd: aespinosa at notty > 32727 ? Rs 0:00 ps x > [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x > Password: > PID TTY STAT TIME COMMAND > 30721 ? S 0:00 sshd: aespinosa at pts/0 > 30722 pts/0 Ss 0:00 -bash > 30951 pts/0 S+ 0:00 ssh tp-c105 ps x > 30955 ? S 0:00 sshd: aespinosa at notty > 30956 ? Rs 0:00 ps x > [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x > The authenticity of host 'tp-c102 (10.135.125.108)' can't be established. > RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7. > Are you sure you want to continue connecting (yes/no)? yes > Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list > of known hosts. > Password: > PID TTY STAT TIME COMMAND > 10274 ? S 0:00 sshd: aespinosa at notty > 10275 ? Rs 0:00 ps x > ... > ... > > > swift session snapshot: > Progress: Selecting site:1014 Submitted:8 Active:1 > Progress: Selecting site:1014 Submitted:8 Active:1 > Progress: Selecting site:1014 Submitted:8 Active:1 > Progress: Selecting site:1014 Submitted:8 Active:1 > > queue information: > ACTIVE JOBS-------------------- > JOBNAME USERNAME STATE PROC REMAINING STARTTIME > > 1122120 aespinosa Running 8 00:19:53 Thu Jul 2 14:22:19 > > 1 Active Job 171 of 200 Processors Active (85.50%) > 100 of 100 Nodes Active (100.00%) > > > > > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Thu Jul 2 14:42:25 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 2 Jul 2009 14:42:25 -0500 Subject: [Swift-devel] workers not initiated on all nodes/cpus in a block In-Reply-To: <1246563543.4778.0.camel@localhost> References: <50b07b4b0907021232w42a186e3yea94e4432e154506@mail.gmail.com> <1246563543.4778.0.camel@localhost> Message-ID: <50b07b4b0907021242y609c8a5wc09a8707f9668f9@mail.gmail.com> yup pbs provider. i'll checkout if the same goes with the globus gt2 provider. -Allan 2009/7/2 Mihael Hategan : > This is with the PBS provider rather than Globus, right? > > On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote: >> looking at the submit script before, even though the coaster block >> requested for 8 nodes, it still simply runs 1 worker >> >> submit script found: >> ?cat PBS2252235058660926788.submit >> #PBS -S /bin/sh >> #PBS -N null >> #PBS -m n >> #PBS -l nodes=8 >> #PBS -l walltime=00:04:00 >> #PBS -q short >> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout >> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr >> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl >> http://128.135.125.116:47679 0702-050234-000004 1 >> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode >> >> >> the /usr/bin/perl line should be prepended with "pbdsh" or other >> equivalent utilities to execute the script on all nodes/cpus. i think >> this is the reason why in some instances the block requests more nodes >> but not all are active. >> >> host information: >> [aespinosa at communicado ~]$ screen -r >> IWD: [NONE] ?Executable: ?[NONE] >> Bypass: 0 ?StartCount: 1 >> PartitionMask: [ALL] >> Flags: ? ? ? RESTARTABLE >> >> Reservation '1122120' (-00:05:07 -> 00:22:53 ?Duration: 00:28:00) >> PE: ?8.00 ?StartPriority: ?1800 >> >> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x >> Password: >> ? PID TTY ? ? ?STAT ? TIME COMMAND >> 31815 ? ? ? ? ?Ss ? ? 0:00 -sh >> 32054 ? ? ? ? ?S ? ? ?0:00 pbs_demux >> 32229 ? ? ? ? ?S ? ? ?0:00 -sh >> 32230 ? ? ? ? ?S ? ? ?0:00 /usr/bin/perl >> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl >> http://128.135.125.116:47679 0702-050234-000003 1 >> 32231 ? ? ? ? ?S ? ? ?0:00 /usr/bin/perl >> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl >> http://128.135.125.116:47679 0702-050234-000003 1 >> 32233 ? ? ? ? ?S ? ? ?0:00 /bin/bash >> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap >> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out >> result/run1416/stdout -err stderr.txt -i -d >> |subproblems|result/run1416 -if >> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so >> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout >> -k ?-status files -a run1416 template armington.mod >> armington_process.cmd armington_output.cmd >> subproblems/producer_tree.mod ces.so >> 32256 ? ? ? ? ?S ? ? ?0:00 /bin/bash /home/zzhang/SEE/static/run_ampl >> run1416 template armington.mod armington_process.cmd >> armington_output.cmd subproblems/producer_tree.mod ces.so >> 32258 ? ? ? ? ?S ? ? ?0:19 ampl arm_test.cmd >> 32716 ? ? ? ? ?R ? ? ?0:37 pathampl /tmp/at32258 -AMPL >> 32726 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at notty >> 32727 ? ? ? ? ?Rs ? ? 0:00 ps x >> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x >> Password: >> ? PID TTY ? ? ?STAT ? TIME COMMAND >> 30721 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at pts/0 >> 30722 pts/0 ? ?Ss ? ? 0:00 -bash >> 30951 pts/0 ? ?S+ ? ? 0:00 ssh tp-c105 ps x >> 30955 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at notty >> 30956 ? ? ? ? ?Rs ? ? 0:00 ps x >> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x >> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established. >> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7. >> Are you sure you want to continue connecting (yes/no)? yes >> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list >> of known hosts. >> Password: >> ? PID TTY ? ? ?STAT ? TIME COMMAND >> 10274 ? ? ? ? ?S ? ? ?0:00 sshd: aespinosa at notty >> 10275 ? ? ? ? ?Rs ? ? 0:00 ps x >> ... >> ... >> >> >> swift session snapshot: >> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1 >> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1 >> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1 >> Progress: ?Selecting site:1014 ?Submitted:8 ?Active:1 >> >> queue information: >> ACTIVE JOBS-------------------- >> JOBNAME ? ? ? ? ? ?USERNAME ? ? ?STATE ?PROC ? REMAINING ? ? ? ? ? ?STARTTIME >> >> 1122120 ? ? ? ? ? ?aespinosa ? ?Running ? ? 8 ? ?00:19:53 ?Thu Jul ?2 14:22:19 >> >> ? ? ?1 Active Job ? ? ?171 of ?200 Processors Active (85.50%) >> ? ? ? ? ? ? ? ? ? ? ? ?100 of ?100 Nodes Active ? ? ?(100.00%) From benc at hawaga.org.uk Fri Jul 3 04:28:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Jul 2009 09:28:01 +0000 (GMT) Subject: [Swift-devel] Patch for swift-osg-ress-site-catalog In-Reply-To: <4A4CDDF6.9030101@renci.org> References: <4A4CDDF6.9030101@renci.org> Message-ID: applied r2995 -- From benc at hawaga.org.uk Fri Jul 3 12:35:41 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Jul 2009 17:35:41 +0000 (GMT) Subject: [Swift-devel] imports Message-ID: swift r2996 contains an import directive which will import SwiftScript code from other .swift files into the current program. This is done deep in the compiler, and is not a preprocessor. You can import the samefile multiple times without trouble it will only be processed once. At present you can only iport files that are in the current working directory. $PTH/$CLASSPATH/$PERL5LIB style path handling should be straightforward to implement, though. -- From bugzilla-daemon at mcs.anl.gov Wed Jul 8 09:37:39 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 8 Jul 2009 09:37:39 -0500 (CDT) Subject: [Swift-devel] [Bug 214] New: Enhance logging and debug capabilities for Condor provider Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=214 Summary: Enhance logging and debug capabilities for Condor provider Product: Swift Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov - create all condor submit files with a log file entry - add a setting to not delete condor files in .globus/scripts after they complete, for debugging -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Wed Jul 8 09:38:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 08 Jul 2009 09:38:23 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1226] Condor hung on communicado In-Reply-To: References: <4A4E05A7.8030503@mcs.anl.gov> <4A54A54D.1010006@mcs.anl.gov> Message-ID: <4A54AF5F.5050200@mcs.anl.gov> done. On 7/8/09 9:20 AM, Ben Clifford wrote: > On Wed, 8 Jul 2009, Michael Wilde wrote: > >> - create all condor submit files with a log file entry >> - a setting to not delete condor files in .globus/scripts after they complete, >> for debugging > > those would be best entered as enhancement requests into the CoG bugzilla. > From rynge at renci.org Fri Jul 10 16:29:31 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 10 Jul 2009 17:29:31 -0400 Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold Message-ID: <4A57B2BB.8030904@renci.org> Looks like Swift is not cleaning up completed/held Condor-G jobs. There are more than 1000 jobs in the queue on engage-submit. Some jobs are in the Done state, some in the Held state. -- Mats Rynge Renaissance Computing Institute From aespinosa at cs.uchicago.edu Fri Jul 10 16:39:48 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 10 Jul 2009 16:39:48 -0500 Subject: [Swift-devel] Re: Condor-G jobs left in the queue upon completion/hold In-Reply-To: <4A57B2BB.8030904@renci.org> References: <4A57B2BB.8030904@renci.org> Message-ID: <50b07b4b0907101439o3054aaf0w78f9f420d1bd49c9@mail.gmail.com> Hi Mats, Just cleaned my jobs on the queue. I did not realize this when my jobs finished last night -Allan 2009/7/10 Mats Rynge : > Looks like Swift is not cleaning up completed/held Condor-G jobs. There are > more than 1000 jobs in the queue on engage-submit. Some jobs are in the Done > state, some in the Held state. From hategan at mcs.anl.gov Fri Jul 10 16:53:08 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 10 Jul 2009 16:53:08 -0500 Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold In-Reply-To: <4A57B2BB.8030904@renci.org> References: <4A57B2BB.8030904@renci.org> Message-ID: <1247262788.15261.9.camel@localhost> Yeah. I think job logs might be a better solution to figure job state than +leave_in_queue, check, -leave_in_queue. On Fri, 2009-07-10 at 17:29 -0400, Mats Rynge wrote: > Looks like Swift is not cleaning up completed/held Condor-G jobs. There > are more than 1000 jobs in the queue on engage-submit. Some jobs are in > the Done state, some in the Held state. > From benc at hawaga.org.uk Sat Jul 11 05:24:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 11 Jul 2009 10:24:19 +0000 (GMT) Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold In-Reply-To: <1247262788.15261.9.camel@localhost> References: <4A57B2BB.8030904@renci.org> <1247262788.15261.9.camel@localhost> Message-ID: On Fri, 10 Jul 2009, Mihael Hategan wrote: > Yeah. I think job logs might be a better solution to figure job state > than +leave_in_queue, check, -leave_in_queue. Interestingly, Miron said the same to me only yesterday... Alain Roy is also watching this and agrees. -- From hategan at mcs.anl.gov Sat Jul 11 10:28:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 11 Jul 2009 10:28:37 -0500 Subject: [Swift-devel] Condor-G jobs left in the queue upon completion/hold In-Reply-To: References: <4A57B2BB.8030904@renci.org> <1247262788.15261.9.camel@localhost> Message-ID: <1247326117.22686.1.camel@localhost> On Sat, 2009-07-11 at 10:24 +0000, Ben Clifford wrote: > > On Fri, 10 Jul 2009, Mihael Hategan wrote: > > > Yeah. I think job logs might be a better solution to figure job state > > than +leave_in_queue, check, -leave_in_queue. > > Interestingly, Miron said the same to me only yesterday... > > Alain Roy is also watching this and agrees. > Though that suffers from its own problems: 1. If a different log is used for every job, lots of files may need to be tailed at once. 2. If a single file is used for every job, is there any guarantee that entries in the log are written atomically? From bugzilla-daemon at mcs.anl.gov Sun Jul 12 16:23:41 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 12 Jul 2009 16:23:41 -0500 (CDT) Subject: [Swift-devel] [Bug 215] New: stdout and stderr redirect for SGE jobmanager causing failure on stageouts Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=215 Summary: stdout and stderr redirect for SGE jobmanager causing failure on stageouts Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu CC: zhaozhang at uchicago.edu stdout and stderr have been redirected when the SGE job manager is detected. however, this seems to cause a gram failure: 7/10 00:39:28 JM: sending callback of status 4 (failure code 155) to https://128.135.92.64:50003/1247203143796. when this redirection is commented out of the swift code, workflows are running properly on the ranger TeraGrid site. however, it should be noted that when redirection is not in place a data.* file is created in the user's $HOME for each job run (thus, if you run many thousands of jobs, you will have a file for each one). -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Mon Jul 13 02:20:43 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 13 Jul 2009 02:20:43 -0500 (CDT) Subject: [Swift-devel] [Bug 216] New: poor compile error when semicolon missing at end of structure definition Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=216 Summary: poor compile error when semicolon missing at end of structure definition Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk In the below code, the error message given is unenlightening. It should refer to something closer to the actual error. Perhaps the parser should fail as soon as it sees a token after the } that is not SEMI Removing files from previous runs Running test 07554-ext-mapper-struct at Mon Jul 13 09:18:39 CEST 2009 Could not start execution. Compile error in procedure invocation at line 16: Type messagefile is not defined. type messagefile; type struct { messagefile eerste; messagefile twede; } // MISSING SEMICOLON HERE (messagefile t) write(string s) { app { echo s stdout=@filename(t); } } messagefile outfiles ; outfiles.eerste = write("1st"); outfiles.twede = write("2nd"); -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Jul 13 02:23:22 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 13 Jul 2009 02:23:22 -0500 (CDT) Subject: [Swift-devel] [Bug 216] poor compile error when semicolon missing at end of structure definition In-Reply-To: References: Message-ID: <20090713072322.144352CC5C@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=216 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID --- Comment #1 from Ben Clifford 2009-07-13 02:23:21 --- actually this bug report is incorrect. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Jul 13 11:45:59 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 11:45:59 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue Message-ID: <4A5B64C7.4080802@mcs.anl.gov> I thought I wrote an email on this, but cant find it, so I will try to recall what I saw. Sarah tried a test run to re-create the problem of "excessive overhead from coasters on the head node". This was spurred by another complaint from the Ranger sysadmins. The complaint had about the same level of detail as the first: it was voice mail saying "your processing are causing too much overhead on the login node". So we tried to do a test to isolate and quantify what was happening. We did not get far enough, but got some initial observations. Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of 50 sleep 300 jobs (approx). This was around 7PM Thu night Jul 9. Sarah, are these logs still there? Can you copy the coaster and swift logs to the CI where we can look at them? What I saw in top (-b -d) and ps was: - two Java processes were created on login3 (headnode) with her ID - one was about 275MB virt mem and burning 100% CPU time, continuously - one was about 1GB virt mem and not burning much time - tailing the coaster log in Sarah's home directory showed repetitive activity, seemingly about every second, a burst of "polling-like" messages - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs, which would be good, I think (in that it seems like jobs were allocated in blocks). At the time we did not have a chance to gather detailed evidence, but I was surprised by two things: - that there were two Java processes and that one was so big. (Are most likely the active process was just a child thread of the main process?) - that there was continual log activity while the 50 jobs were sleeping. But I dont have solid evidence that the 50 jobs were actually running and sleeping. I think if we correlate the swift log and the coaster log here we might learn more. I dont know if this was using Mihael's latest code with a reduced logging level or not. Allan, this seems like it should be straightforward to reproduce now, so please go ahead and try to do that, and capture everything, including ideally the profile info that Mihael was trying to explain to Zhao how to capture. - Mike From hategan at mcs.anl.gov Mon Jul 13 12:04:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 12:04:02 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <4A5B64C7.4080802@mcs.anl.gov> References: <4A5B64C7.4080802@mcs.anl.gov> Message-ID: <1247504642.17460.6.camel@localhost> On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote: > I thought I wrote an email on this, but cant find it, so I will try to > recall what I saw. > > Sarah tried a test run to re-create the problem of "excessive overhead > from coasters on the head node". This was spurred by another complaint > from the Ranger sysadmins. The complaint had about the same level of > detail as the first: it was voice mail saying "your processing are > causing too much overhead on the login node". > > So we tried to do a test to isolate and quantify what was happening. We > did not get far enough, but got some initial observations. > > Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of > 50 sleep 300 jobs (approx). > > This was around 7PM Thu night Jul 9. Sarah, are these logs still there? > Can you copy the coaster and swift logs to the CI where we can look at them? > > What I saw in top (-b -d) and ps was: > > - two Java processes were created on login3 (headnode) with her ID > - one was about 275MB virt mem and burning 100% CPU time, continuously > - one was about 1GB virt mem and not burning much time > - tailing the coaster log in Sarah's home directory showed repetitive > activity, seemingly about every second, a burst of "polling-like" messages > - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs, > which would be good, I think (in that it seems like jobs were allocated > in blocks). > > At the time we did not have a chance to gather detailed evidence, but I > was surprised by two things: > > - that there were two Java processes and that one was so big. (Are most > likely the active process was just a child thread of the main process?) One java process is the bootstrap process (it downloads the coaster jars, sets up the environment and runs the coaster service). It has always been like this. Did you happen to capture the output of ps to a file? That would be useful, because from what you are suggesting, it appears that the bootstrap process is eating 100% CPU. That process should only be sleeping after the service is started. > > - that there was continual log activity By some very odd definition of "continual". The schedule is re-computed periodically. The messages also tell you how much time it takes to re-compute the schedule, which divided by the pause interval should give you the maximum CPU usage for the process for a time period, other things ignored. In the idle state, this takes around 1ms (0.1% CPU usage). > while the 50 jobs were sleeping. > But I dont have solid evidence that the 50 jobs were actually running > and sleeping. > > I think if we correlate the swift log and the coaster log here we might > learn more. > > I dont know if this was using Mihael's latest code with a reduced > logging level or not. > > Allan, this seems like it should be straightforward to reproduce now, so > please go ahead and try to do that, and capture everything, including > ideally the profile info that Mihael was trying to explain to Zhao how > to capture. > > - Mike > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Jul 13 12:28:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 12:28:54 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247504642.17460.6.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> Message-ID: <4A5B6ED6.60508@mcs.anl.gov> On 7/13/09 12:04 PM, Mihael Hategan wrote: > On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote: >> I thought I wrote an email on this, but cant find it, so I will try to >> recall what I saw. >> >> Sarah tried a test run to re-create the problem of "excessive overhead >> from coasters on the head node". This was spurred by another complaint >> from the Ranger sysadmins. The complaint had about the same level of >> detail as the first: it was voice mail saying "your processing are >> causing too much overhead on the login node". >> >> So we tried to do a test to isolate and quantify what was happening. We >> did not get far enough, but got some initial observations. >> >> Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of >> 50 sleep 300 jobs (approx). >> >> This was around 7PM Thu night Jul 9. Sarah, are these logs still there? >> Can you copy the coaster and swift logs to the CI where we can look at them? >> >> What I saw in top (-b -d) and ps was: >> >> - two Java processes were created on login3 (headnode) with her ID >> - one was about 275MB virt mem and burning 100% CPU time, continuously >> - one was about 1GB virt mem and not burning much time >> - tailing the coaster log in Sarah's home directory showed repetitive >> activity, seemingly about every second, a burst of "polling-like" messages >> - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs, >> which would be good, I think (in that it seems like jobs were allocated >> in blocks). >> >> At the time we did not have a chance to gather detailed evidence, but I >> was surprised by two things: >> >> - that there were two Java processes and that one was so big. (Are most >> likely the active process was just a child thread of the main process?) > > One java process is the bootstrap process (it downloads the coaster > jars, sets up the environment and runs the coaster service). It has > always been like this. Did you happen to capture the output of ps to a > file? That would be useful, because from what you are suggesting, it > appears that the bootstrap process is eating 100% CPU. That process > should only be sleeping after the service is started. I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant locate it. As best as I can recall it showed the larger memory-footprint process to be relatively idle, and the smaller footprint process (about 275MB) to be burning 100% of a CPU. Allan will try to get a snapshot of this shortly. If this observation if correct, whats the best way to find out where its spinning? Profiling? Debug logging? Can you get profiling data from a JVM that doesnt exit? - Mike > >> - that there was continual log activity > > By some very odd definition of "continual". The schedule is re-computed > periodically. The messages also tell you how much time it takes to > re-compute the schedule, which divided by the pause interval should give > you the maximum CPU usage for the process for a time period, other > things ignored. In the idle state, this takes around 1ms (0.1% CPU > usage). > >> while the 50 jobs were sleeping. >> But I dont have solid evidence that the 50 jobs were actually running >> and sleeping. >> >> I think if we correlate the swift log and the coaster log here we might >> learn more. >> >> I dont know if this was using Mihael's latest code with a reduced >> logging level or not. >> >> Allan, this seems like it should be straightforward to reproduce now, so >> please go ahead and try to do that, and capture everything, including >> ideally the profile info that Mihael was trying to explain to Zhao how >> to capture. >> >> - Mike >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Jul 13 13:23:15 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 13:23:15 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <4A5B6ED6.60508@mcs.anl.gov> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> Message-ID: <1247509395.20144.4.camel@localhost> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: > >> > >> At the time we did not have a chance to gather detailed evidence, but I > >> was surprised by two things: > >> > >> - that there were two Java processes and that one was so big. (Are most > >> likely the active process was just a child thread of the main process?) > > > > One java process is the bootstrap process (it downloads the coaster > > jars, sets up the environment and runs the coaster service). It has > > always been like this. Did you happen to capture the output of ps to a > > file? That would be useful, because from what you are suggesting, it > > appears that the bootstrap process is eating 100% CPU. That process > > should only be sleeping after the service is started. > > I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant > locate it. > > As best as I can recall it showed the larger memory-footprint process to > be relatively idle, and the smaller footprint process (about 275MB) to > be burning 100% of a CPU. Normally, the smaller footprint process should be the bootstrap. But that's why I would like the ps output, because it sounds odd. > Allan will try to get a snapshot of this shortly. > > If this observation if correct, whats the best way to find out where its > spinning? Profiling? Debug logging? Can you get profiling data from a > JVM that doesnt exit? Once I know where it is, I can look at the code and then we'll go from there. From aespinosa at cs.uchicago.edu Mon Jul 13 13:55:18 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 13 Jul 2009 13:55:18 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247509395.20144.4.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> Message-ID: <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From here process 22395 is the child of the main java process (bootstrap.jar) and is loading the CPU. I have coasters.log, worker-*log, swift logs, gram logs in ~aespinosa/workflows/activelog/run06. This refers to a different run. PID 15206 is the child java process of bootstrap.jar in here. top snapshot: top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55 Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man ps snapshot: 22328 ? S 0:00 \_ /bin/bash 22364 ? Sl 0:00 \_ /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 https://128.135.125.17:46519 11505253269 22395 ? SNl 6:29 \_ /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -Djava.security.egd=file:///dev/urandom -cp /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc 2009/7/13 Mihael Hategan : > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >> >> >> >> At the time we did not have a chance to gather detailed evidence, but I >> >> was surprised by two things: >> >> >> >> - that there were two Java processes and that one was so big. (Are most >> >> likely the active process was just a child thread of the main process?) >> > >> > One java process is the bootstrap process (it downloads the coaster >> > jars, sets up the environment and runs the coaster service). It has >> > always been like this. Did you happen to capture the output of ps to a >> > file? That would be useful, because from what you are suggesting, it >> > appears that the bootstrap process is eating 100% CPU. That process >> > should only be sleeping after the service is started. >> >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant >> locate it. >> >> As best as I can recall it showed the larger memory-footprint process to >> be relatively idle, and the smaller footprint process (about 275MB) to >> be burning 100% of a CPU. > > Normally, the smaller footprint process should be the bootstrap. But > that's why I would like the ps output, because it sounds odd. > >> ? Allan will try to get a snapshot of this shortly. >> >> If this observation if correct, whats the best way to find out where its >> spinning? Profiling? Debug logging? Can you get profiling data from a >> JVM that doesnt exit? > > Once I know where it is, I can look at the code and then we'll go from > there. > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Mon Jul 13 14:06:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 14:06:09 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> Message-ID: <1247511969.21171.4.camel@localhost> A while ago I committed a patch to run the service process with a lower priority. Is that in use? Also, is logging reduced or is it the default? Is the 97% CPU usage a spike, or does it stay there on average? Can I take a look at the coaster logs from skenny's run on ranger? I'd also like to point out in as little offensive mode as I can, that I'm working 100% on I2U2 and my lack of getting more than lightly involved in this is a consequence of that. On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: > I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From > here process 22395 is the child of the main java process > (bootstrap.jar) and is loading the CPU. > > I have coasters.log, worker-*log, swift logs, gram logs in > ~aespinosa/workflows/activelog/run06. This refers to a different run. > PID 15206 is the child java process of bootstrap.jar in here. > > top snapshot: > top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55 > Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie > Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers > Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java > 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top > 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana > 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd > 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash > 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash > 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash > 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java > 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man > > ps snapshot: > > 22328 ? S 0:00 \_ /bin/bash > 22364 ? Sl 0:00 \_ > /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java > -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= > -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar > /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 > https://128.135.125.17:46519 11505253269 > 22395 ? SNl 6:29 \_ > /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M > -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu > -Djava.security.egd=file:///dev/urandom -cp > /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc > > > > 2009/7/13 Mihael Hategan : > > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: > >> >> > >> >> At the time we did not have a chance to gather detailed evidence, but I > >> >> was surprised by two things: > >> >> > >> >> - that there were two Java processes and that one was so big. (Are most > >> >> likely the active process was just a child thread of the main process?) > >> > > >> > One java process is the bootstrap process (it downloads the coaster > >> > jars, sets up the environment and runs the coaster service). It has > >> > always been like this. Did you happen to capture the output of ps to a > >> > file? That would be useful, because from what you are suggesting, it > >> > appears that the bootstrap process is eating 100% CPU. That process > >> > should only be sleeping after the service is started. > >> > >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant > >> locate it. > >> > >> As best as I can recall it showed the larger memory-footprint process to > >> be relatively idle, and the smaller footprint process (about 275MB) to > >> be burning 100% of a CPU. > > > > Normally, the smaller footprint process should be the bootstrap. But > > that's why I would like the ps output, because it sounds odd. > > > >> Allan will try to get a snapshot of this shortly. > >> > >> If this observation if correct, whats the best way to find out where its > >> spinning? Profiling? Debug logging? Can you get profiling data from a > >> JVM that doesnt exit? > > > > Once I know where it is, I can look at the code and then we'll go from > > there. > > > > > > > > > From wilde at mcs.anl.gov Mon Jul 13 14:11:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 14:11:34 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247511969.21171.4.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> Message-ID: <4A5B86E6.2000803@mcs.anl.gov> On 7/13/09 2:06 PM, Mihael Hategan wrote: > A while ago I committed a patch to run the service process with a lower > priority. Is that in use? Looks like 22395 is running with a nice value of 10 which I think is what you set in that patch: 22395 aespinos 25 10 > > Also, is logging reduced or is it the default? > > Is the 97% CPU usage a spike, or does it stay there on average? > > Can I take a look at the coaster logs from skenny's run on ranger? > > I'd also like to point out in as little offensive mode as I can, that > I'm working 100% on I2U2 and my lack of getting more than lightly > involved in this is a consequence of that. Right, understood. Any pointers you can give are welcome, and Allan and I are expecting to do the legwork. We'll at least try to find out where the overhead is coming from. - Mike > > On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From >> here process 22395 is the child of the main java process >> (bootstrap.jar) and is loading the CPU. >> >> I have coasters.log, worker-*log, swift logs, gram logs in >> ~aespinosa/workflows/activelog/run06. This refers to a different run. >> PID 15206 is the child java process of bootstrap.jar in here. >> >> top snapshot: >> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55 >> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie >> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers >> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java >> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top >> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana >> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd >> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash >> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash >> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash >> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java >> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man >> >> ps snapshot: >> >> 22328 ? S 0:00 \_ /bin/bash >> 22364 ? Sl 0:00 \_ >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 >> https://128.135.125.17:46519 11505253269 >> 22395 ? SNl 6:29 \_ >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >> -Djava.security.egd=file:///dev/urandom -cp >> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec94 6b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar: /home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_se rvice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >> >> >> >> 2009/7/13 Mihael Hategan : >>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>> At the time we did not have a chance to gather detailed evidence, but I >>>>>> was surprised by two things: >>>>>> >>>>>> - that there were two Java processes and that one was so big. (Are most >>>>>> likely the active process was just a child thread of the main process?) >>>>> One java process is the bootstrap process (it downloads the coaster >>>>> jars, sets up the environment and runs the coaster service). It has >>>>> always been like this. Did you happen to capture the output of ps to a >>>>> file? That would be useful, because from what you are suggesting, it >>>>> appears that the bootstrap process is eating 100% CPU. That process >>>>> should only be sleeping after the service is started. >>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant >>>> locate it. >>>> >>>> As best as I can recall it showed the larger memory-footprint process to >>>> be relatively idle, and the smaller footprint process (about 275MB) to >>>> be burning 100% of a CPU. >>> Normally, the smaller footprint process should be the bootstrap. But >>> that's why I would like the ps output, because it sounds odd. >>> >>>> Allan will try to get a snapshot of this shortly. >>>> >>>> If this observation if correct, whats the best way to find out where its >>>> spinning? Profiling? Debug logging? Can you get profiling data from a >>>> JVM that doesnt exit? >>> Once I know where it is, I can look at the code and then we'll go from >>> there. >>> >>> >>> >> >> > From aespinosa at cs.uchicago.edu Mon Jul 13 14:12:44 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 13 Jul 2009 14:12:44 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247511969.21171.4.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> Message-ID: <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> 97% is an average as can be seen in run06. swift version is r3005 and cogkit r2410. this is a vanilla build of swift. 2009/7/13 Mihael Hategan : > A while ago I committed a patch to run the service process with a lower > priority. Is that in use? > > Also, is logging reduced or is it the default? > > Is the 97% CPU usage a spike, or does it stay there on average? > > Can I take a look at the coaster logs from skenny's run on ranger? > > I'd also like to point out in as little offensive mode as I can, that > I'm working 100% on I2U2 and my lack of getting more than lightly > involved in this is a consequence of that. > > On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. ?From >> here process 22395 is the child of the main java process >> (bootstrap.jar) and is loading the CPU. >> >> I have coasters.log, worker-*log, swift logs, gram logs in >> ~aespinosa/workflows/activelog/run06. ?This refers to a different run. >> ?PID 15206 is the child java process of bootstrap.jar in here. >> >> top snapshot: >> top - 13:49:03 up 55 days, ?1:45, ?1 user, ?load average: 1.18, 0.80, 0.55 >> Tasks: 121 total, ? 1 running, 120 sleeping, ? 0 stopped, ? 0 zombie >> Cpu(s): ?7.5%us, ?2.8%sy, 48.7%ni, 41.0%id, ?0.0%wa, ?0.0%hi, ?0.0%si, ?0.0%st >> Mem: ? 4058916k total, ?3889864k used, ? 169052k free, ? 239688k buffers >> Swap: ?4192956k total, ? ? ? 96k used, ?4192860k free, ?2504812k cached >> >> ? PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?COMMAND >> 22395 aespinos ?25 ?10 ?525m ?91m ?13m S 97.5 ?2.3 ? 4:29.22 java >> 22217 aespinos ?15 ? 0 10736 1048 ?776 R ?0.3 ?0.0 ? 0:00.50 top >> 22243 aespinos ?16 ? 0 ?102m 5576 3536 S ?0.3 ?0.1 ? 0:00.10 globus-job-mana >> 14764 aespinos ?15 ? 0 98024 1744 ?976 S ?0.0 ?0.0 ? 0:00.06 sshd >> 14765 aespinos ?15 ? 0 65364 2796 1176 S ?0.0 ?0.1 ? 0:00.18 bash >> 22326 aespinos ?18 ? 0 ?8916 1052 ?852 S ?0.0 ?0.0 ? 0:00.00 bash >> 22328 aespinos ?19 ? 0 ?8916 1116 ?908 S ?0.0 ?0.0 ? 0:00.00 bash >> 22364 aespinos ?15 ? 0 1222m ?18m 8976 S ?0.0 ?0.5 ? 0:00.20 java >> 22444 aespinos ?16 ? 0 ?102m 5684 3528 S ?0.0 ?0.1 ? 0:00.09 globus-job-man >> >> ps snapshot: >> >> 22328 ? ? ? ? ?S ? ? ?0:00 ?\_ /bin/bash >> 22364 ? ? ? ? ?Sl ? ? 0:00 ? ? ?\_ >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 >> https://128.135.125.17:46519 11505253269 >> 22395 ? ? ? ? ?SNl ? ?6:29 ? ? ? ? ?\_ >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >> -Djava.security.egd=file:///dev/urandom -cp >> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >> >> >> >> 2009/7/13 Mihael Hategan : >> > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >> >> >> >> >> >> At the time we did not have a chance to gather detailed evidence, but I >> >> >> was surprised by two things: >> >> >> >> >> >> - that there were two Java processes and that one was so big. (Are most >> >> >> likely the active process was just a child thread of the main process?) >> >> > >> >> > One java process is the bootstrap process (it downloads the coaster >> >> > jars, sets up the environment and runs the coaster service). It has >> >> > always been like this. Did you happen to capture the output of ps to a >> >> > file? That would be useful, because from what you are suggesting, it >> >> > appears that the bootstrap process is eating 100% CPU. That process >> >> > should only be sleeping after the service is started. >> >> >> >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant >> >> locate it. >> >> >> >> As best as I can recall it showed the larger memory-footprint process to >> >> be relatively idle, and the smaller footprint process (about 275MB) to >> >> be burning 100% of a CPU. >> > >> > Normally, the smaller footprint process should be the bootstrap. But >> > that's why I would like the ps output, because it sounds odd. >> > >> >> ? Allan will try to get a snapshot of this shortly. >> >> >> >> If this observation if correct, whats the best way to find out where its >> >> spinning? Profiling? Debug logging? Can you get profiling data from a >> >> JVM that doesnt exit? >> > >> > Once I know where it is, I can look at the code and then we'll go from >> > there. >> > >> > >> > >> >> >> > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Mon Jul 13 14:18:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 14:18:05 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247511969.21171.4.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> Message-ID: <4A5B886D.1020400@mcs.anl.gov> On 7/13/09 2:06 PM, Mihael Hategan wrote: > A while ago I committed a patch to run the service process with a lower > priority. Is that in use? > > Also, is logging reduced or is it the default? > > Is the 97% CPU usage a spike, or does it stay there on average? In the test I observed Sarah running last Thu, it stayed close to 100% during the whole run - many minutes, solid near-100% CPU. During that time a tail of the coaster log showed a burst of a few messages every few seconds - not intensive enough to explain the overhead as all due to logging. Allan will need to comment on the runs he describes below. - Mike > > Can I take a look at the coaster logs from skenny's run on ranger? > > I'd also like to point out in as little offensive mode as I can, that > I'm working 100% on I2U2 and my lack of getting more than lightly > involved in this is a consequence of that. > > On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From >> here process 22395 is the child of the main java process >> (bootstrap.jar) and is loading the CPU. >> >> I have coasters.log, worker-*log, swift logs, gram logs in >> ~aespinosa/workflows/activelog/run06. This refers to a different run. >> PID 15206 is the child java process of bootstrap.jar in here. >> >> top snapshot: >> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55 >> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie >> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers >> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java >> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top >> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana >> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd >> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash >> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash >> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash >> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java >> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man >> >> ps snapshot: >> >> 22328 ? S 0:00 \_ /bin/bash >> 22364 ? Sl 0:00 \_ >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 >> https://128.135.125.17:46519 11505253269 >> 22395 ? SNl 6:29 \_ >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >> -Djava.security.egd=file:///dev/urandom -cp >> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec94 6b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar: /home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_se rvice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >> >> >> >> 2009/7/13 Mihael Hategan : >>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>> At the time we did not have a chance to gather detailed evidence, but I >>>>>> was surprised by two things: >>>>>> >>>>>> - that there were two Java processes and that one was so big. (Are most >>>>>> likely the active process was just a child thread of the main process?) >>>>> One java process is the bootstrap process (it downloads the coaster >>>>> jars, sets up the environment and runs the coaster service). It has >>>>> always been like this. Did you happen to capture the output of ps to a >>>>> file? That would be useful, because from what you are suggesting, it >>>>> appears that the bootstrap process is eating 100% CPU. That process >>>>> should only be sleeping after the service is started. >>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant >>>> locate it. >>>> >>>> As best as I can recall it showed the larger memory-footprint process to >>>> be relatively idle, and the smaller footprint process (about 275MB) to >>>> be burning 100% of a CPU. >>> Normally, the smaller footprint process should be the bootstrap. But >>> that's why I would like the ps output, because it sounds odd. >>> >>>> Allan will try to get a snapshot of this shortly. >>>> >>>> If this observation if correct, whats the best way to find out where its >>>> spinning? Profiling? Debug logging? Can you get profiling data from a >>>> JVM that doesnt exit? >>> Once I know where it is, I can look at the code and then we'll go from >>> there. >>> >>> >>> >> >> > From hategan at mcs.anl.gov Mon Jul 13 14:24:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 14:24:25 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <4A5B86E6.2000803@mcs.anl.gov> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <4A5B86E6.2000803@mcs.anl.gov> Message-ID: <1247513065.21484.8.camel@localhost> On Mon, 2009-07-13 at 14:11 -0500, Michael Wilde wrote: > On 7/13/09 2:06 PM, Mihael Hategan wrote: > > A while ago I committed a patch to run the service process with a lower > > priority. Is that in use? > > Looks like 22395 is running with a nice value of 10 which I think is > what you set in that patch: 22395 aespinos 25 10 Ok. Now, lower priority doesn't mean it won't use CPU. It means that other processes with a higher priority will get preferential treatment, and if there is CPU left and the coasters need it, it will be used. In other words, near 100% CPU usage isn't in itself a problem. While it shouldn't stay there according to my understanding of the code, if that is the only problem observed, then I think it's an overreaction. > > > > Also, is logging reduced or is it the default? > > > > Is the 97% CPU usage a spike, or does it stay there on average? > > > > Can I take a look at the coaster logs from skenny's run on ranger? > > > > I'd also like to point out in as little offensive mode as I can, that > > I'm working 100% on I2U2 and my lack of getting more than lightly > > involved in this is a consequence of that. > > Right, understood. Any pointers you can give are welcome, and Allan and > I are expecting to do the legwork. We'll at least try to find out where > the overhead is coming from. I find it somewhat odd that there was a process with 1GB of virtual memory use. Are you sure that wasn't a WSRF container from somebody else? Can we switch to exclusive evidence mode here (i.e. nothing is considered unless there is clear proof of it, like a screen dump or log output, or copy an paste of session from a terminal)? From hategan at mcs.anl.gov Mon Jul 13 14:25:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 14:25:44 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> Message-ID: <1247513144.21484.10.camel@localhost> On Mon, 2009-07-13 at 14:12 -0500, Allan Espinosa wrote: > 97% is an average as can be seen in run06. swift version is r3005 and > cogkit r2410. this is a vanilla build of swift. Can you run with reduced logging? We established before that logging appears to be a problem and before we eliminate that it's wasteful to continue guessing. > > 2009/7/13 Mihael Hategan : > > A while ago I committed a patch to run the service process with a lower > > priority. Is that in use? > > > > Also, is logging reduced or is it the default? > > > > Is the 97% CPU usage a spike, or does it stay there on average? > > > > Can I take a look at the coaster logs from skenny's run on ranger? > > > > I'd also like to point out in as little offensive mode as I can, that > > I'm working 100% on I2U2 and my lack of getting more than lightly > > involved in this is a consequence of that. > > > > On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: > >> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From > >> here process 22395 is the child of the main java process > >> (bootstrap.jar) and is loading the CPU. > >> > >> I have coasters.log, worker-*log, swift logs, gram logs in > >> ~aespinosa/workflows/activelog/run06. This refers to a different run. > >> PID 15206 is the child java process of bootstrap.jar in here. > >> > >> top snapshot: > >> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55 > >> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie > >> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > >> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers > >> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java > >> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top > >> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana > >> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd > >> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash > >> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash > >> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash > >> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java > >> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man > >> > >> ps snapshot: > >> > >> 22328 ? S 0:00 \_ /bin/bash > >> 22364 ? Sl 0:00 \_ > >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java > >> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= > >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > >> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar > >> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 > >> https://128.135.125.17:46519 11505253269 > >> 22395 ? SNl 6:29 \_ > >> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M > >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > >> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu > >> -Djava.security.egd=file:///dev/urandom -cp > >> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec946b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_service-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc > >> > >> > >> > >> 2009/7/13 Mihael Hategan : > >> > On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: > >> >> >> > >> >> >> At the time we did not have a chance to gather detailed evidence, but I > >> >> >> was surprised by two things: > >> >> >> > >> >> >> - that there were two Java processes and that one was so big. (Are most > >> >> >> likely the active process was just a child thread of the main process?) > >> >> > > >> >> > One java process is the bootstrap process (it downloads the coaster > >> >> > jars, sets up the environment and runs the coaster service). It has > >> >> > always been like this. Did you happen to capture the output of ps to a > >> >> > file? That would be useful, because from what you are suggesting, it > >> >> > appears that the bootstrap process is eating 100% CPU. That process > >> >> > should only be sleeping after the service is started. > >> >> > >> >> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant > >> >> locate it. > >> >> > >> >> As best as I can recall it showed the larger memory-footprint process to > >> >> be relatively idle, and the smaller footprint process (about 275MB) to > >> >> be burning 100% of a CPU. > >> > > >> > Normally, the smaller footprint process should be the bootstrap. But > >> > that's why I would like the ps output, because it sounds odd. > >> > > >> >> Allan will try to get a snapshot of this shortly. > >> >> > >> >> If this observation if correct, whats the best way to find out where its > >> >> spinning? Profiling? Debug logging? Can you get profiling data from a > >> >> JVM that doesnt exit? > >> > > >> > Once I know where it is, I can look at the code and then we'll go from > >> > there. > >> > > >> > > >> > > >> > >> > >> > > > > > > > > > From hategan at mcs.anl.gov Mon Jul 13 14:30:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 14:30:40 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <4A5B886D.1020400@mcs.anl.gov> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <4A5B886D.1020400@mcs.anl.gov> Message-ID: <1247513440.21484.16.camel@localhost> On Mon, 2009-07-13 at 14:18 -0500, Michael Wilde wrote: > On 7/13/09 2:06 PM, Mihael Hategan wrote: > > A while ago I committed a patch to run the service process with a lower > > priority. Is that in use? > > > > Also, is logging reduced or is it the default? > > > > Is the 97% CPU usage a spike, or does it stay there on average? > > In the test I observed Sarah running last Thu, it stayed close to 100% > during the whole run - many minutes, solid near-100% CPU. During that > time a tail of the coaster log showed a burst of a few messages every > few seconds - not intensive enough to explain the overhead as all due to > logging. Ok, that does look like a problem. I need to see the log from that. In addition, when you observe this SPECIFIC behavior (solid near 100% CPU, burst of a few messages every few seconds and not much else in the logs), please do a jstack on the process in question and send the output of that. From hategan at mcs.anl.gov Mon Jul 13 14:34:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 14:34:04 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247513440.21484.16.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <4A5B886D.1020400@mcs.anl.gov> <1247513440.21484.16.camel@localhost> Message-ID: <1247513644.21484.18.camel@localhost> On Mon, 2009-07-13 at 14:30 -0500, Mihael Hategan wrote: > On Mon, 2009-07-13 at 14:18 -0500, Michael Wilde wrote: > > On 7/13/09 2:06 PM, Mihael Hategan wrote: > > > A while ago I committed a patch to run the service process with a lower > > > priority. Is that in use? > > > > > > Also, is logging reduced or is it the default? > > > > > > Is the 97% CPU usage a spike, or does it stay there on average? > > > > In the test I observed Sarah running last Thu, it stayed close to 100% > > during the whole run - many minutes, solid near-100% CPU. During that > > time a tail of the coaster log showed a burst of a few messages every > > few seconds - not intensive enough to explain the overhead as all due to > > logging. > > Ok, that does look like a problem. I need to see the log from that. However, I want to stress out that it may NOT be the same problem in all cases of high CPU usage. So reduced logging should still be used before trying to reproduce this specific problem. > > In addition, when you observe this SPECIFIC behavior (solid near 100% > CPU, burst of a few messages every few seconds and not much else in the > logs), please do a jstack on the process in question and send the > output of that. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Mon Jul 13 17:04:35 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 13 Jul 2009 17:04:35 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <4A5BA007.2050101@mcs.anl.gov> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> Message-ID: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> hi, here is a patch which solves the cpu usage on the bootstrap coaster service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch suggested svn log entry: Added locks via wait() and notify() to prevent busy waiting/ active polling in the block task queue. Test 2000 touch job using 066-many.swift via local:local : before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 CPU usage drops from 100% to 0% with a few 25-40 % spikes! -Allan 2009/7/13 Michael Wilde : > Hi Allan, > > I think the methods you want for synchronization are part of class Object. > > They are documented in the chapter Threads and Locks of The Java Language > Specification: > > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 > > queue.wait() should be called if the queue is empty. > > queue.notify() or .notifyall() should be called when something is added to > the queue. I think notify() should work. > > .wait will I think take a timer, but suspect you dont need that. > > Both should be called within the synchronized(queue) constructs that are > already in the code. > > Should be fun to fix this! > > - Mike > > > > > > On 7/13/09 2:12 PM, Allan Espinosa wrote: >> >> 97% is an average as can be seen in run06. ?swift version is r3005 and >> cogkit r2410. ?this is a vanilla build of swift. >> >> 2009/7/13 Mihael Hategan : >>> >>> A while ago I committed a patch to run the service process with a lower >>> priority. Is that in use? >>> >>> Also, is logging reduced or is it the default? >>> >>> Is the 97% CPU usage a spike, or does it stay there on average? >>> >>> Can I take a look at the coaster logs from skenny's run on ranger? >>> >>> I'd also like to point out in as little offensive mode as I can, that >>> I'm working 100% on I2U2 and my lack of getting more than lightly >>> involved in this is a consequence of that. >>> >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >>>> >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. ?From >>>> here process 22395 is the child of the main java process >>>> (bootstrap.jar) and is loading the CPU. >>>> >>>> I have coasters.log, worker-*log, swift logs, gram logs in >>>> ~aespinosa/workflows/activelog/run06. ?This refers to a different run. >>>> ?PID 15206 is the child java process of bootstrap.jar in here. >>>> >>>> top snapshot: >>>> top - 13:49:03 up 55 days, ?1:45, ?1 user, ?load average: 1.18, 0.80, >>>> 0.55 >>>> Tasks: 121 total, ? 1 running, 120 sleeping, ? 0 stopped, ? 0 zombie >>>> Cpu(s): ?7.5%us, ?2.8%sy, 48.7%ni, 41.0%id, ?0.0%wa, ?0.0%hi, ?0.0%si, >>>> ?0.0%st >>>> Mem: ? 4058916k total, ?3889864k used, ? 169052k free, ? 239688k buffers >>>> Swap: ?4192956k total, ? ? ? 96k used, ?4192860k free, ?2504812k cached >>>> >>>> ?PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?COMMAND >>>> 22395 aespinos ?25 ?10 ?525m ?91m ?13m S 97.5 ?2.3 ? 4:29.22 java >>>> 22217 aespinos ?15 ? 0 10736 1048 ?776 R ?0.3 ?0.0 ? 0:00.50 top >>>> 22243 aespinos ?16 ? 0 ?102m 5576 3536 S ?0.3 ?0.1 ? 0:00.10 >>>> globus-job-mana >>>> 14764 aespinos ?15 ? 0 98024 1744 ?976 S ?0.0 ?0.0 ? 0:00.06 sshd >>>> 14765 aespinos ?15 ? 0 65364 2796 1176 S ?0.0 ?0.1 ? 0:00.18 bash >>>> 22326 aespinos ?18 ? 0 ?8916 1052 ?852 S ?0.0 ?0.0 ? 0:00.00 bash >>>> 22328 aespinos ?19 ? 0 ?8916 1116 ?908 S ?0.0 ?0.0 ? 0:00.00 bash >>>> 22364 aespinos ?15 ? 0 1222m ?18m 8976 S ?0.0 ?0.5 ? 0:00.20 java >>>> 22444 aespinos ?16 ? 0 ?102m 5684 3528 S ?0.0 ?0.1 ? 0:00.09 >>>> globus-job-man >>>> >>>> ps snapshot: >>>> >>>> 22328 ? ? ? ? ?S ? ? ?0:00 ?\_ /bin/bash >>>> 22364 ? ? ? ? ?Sl ? ? 0:00 ? ? ?\_ >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >>>> >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 >>>> https://128.135.125.17:46519 11505253269 >>>> 22395 ? ? ? ? ?SNl ? ?6:29 ? ? ? ? ?\_ >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >>>> >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >>>> -Djava.security.egd=file:///dev/urandom -cp >>>> >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec9 > > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_s > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >>>> >>>> >>>> >>>> 2009/7/13 Mihael Hategan : >>>>> >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>>>> >>>>>>>> At the time we did not have a chance to gather detailed evidence, >>>>>>>> but I >>>>>>>> was surprised by two things: >>>>>>>> >>>>>>>> - that there were two Java processes and that one was so big. (Are >>>>>>>> most >>>>>>>> likely the active process was just a child thread of the main >>>>>>>> process?) >>>>>>> >>>>>>> One java process is the bootstrap process (it downloads the coaster >>>>>>> jars, sets up the environment and runs the coaster service). It has >>>>>>> always been like this. Did you happen to capture the output of ps to >>>>>>> a >>>>>>> file? That would be useful, because from what you are suggesting, it >>>>>>> appears that the bootstrap process is eating 100% CPU. That process >>>>>>> should only be sleeping after the service is started. >>>>>> >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I >>>>>> cant >>>>>> locate it. >>>>>> >>>>>> As best as I can recall it showed the larger memory-footprint process >>>>>> to >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to >>>>>> be burning 100% of a CPU. >>>>> >>>>> Normally, the smaller footprint process should be the bootstrap. But >>>>> that's why I would like the ps output, because it sounds odd. >>>>> >>>>>> ?Allan will try to get a snapshot of this shortly. >>>>>> >>>>>> If this observation if correct, whats the best way to find out where >>>>>> its >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a >>>>>> JVM that doesnt exit? >>>>> >>>>> Once I know where it is, I can look at the code and then we'll go from >>>>> there. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Mon Jul 13 17:17:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 17:17:26 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> Message-ID: <4A5BB276.1070902@mcs.anl.gov> Nice! Now lets beat it up and see how well it works. Sarah: Allan did not encounter the error messages you mentioned to me. I suggest you do this: - post to the devel list the messages you got - test this patch to see if it clears up the problem Mike On 7/13/09 5:04 PM, Allan Espinosa wrote: > hi, > > here is a patch which solves the cpu usage on the bootstrap coaster > service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch > > suggested svn log entry: > Added locks via wait() and notify() to prevent busy waiting/ > active polling in the block task queue. > > > Test 2000 touch job using 066-many.swift via local:local : > before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 > after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 > > CPU usage drops from 100% to 0% with a few 25-40 % spikes! > > -Allan > > > 2009/7/13 Michael Wilde : >> Hi Allan, >> >> I think the methods you want for synchronization are part of class Object. >> >> They are documented in the chapter Threads and Locks of The Java Language >> Specification: >> >> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 >> >> queue.wait() should be called if the queue is empty. >> >> queue.notify() or .notifyall() should be called when something is added to >> the queue. I think notify() should work. >> >> .wait will I think take a timer, but suspect you dont need that. >> >> Both should be called within the synchronized(queue) constructs that are >> already in the code. >> >> Should be fun to fix this! >> >> - Mike >> >> >> >> >> >> On 7/13/09 2:12 PM, Allan Espinosa wrote: >>> 97% is an average as can be seen in run06. swift version is r3005 and >>> cogkit r2410. this is a vanilla build of swift. >>> >>> 2009/7/13 Mihael Hategan : >>>> A while ago I committed a patch to run the service process with a lower >>>> priority. Is that in use? >>>> >>>> Also, is logging reduced or is it the default? >>>> >>>> Is the 97% CPU usage a spike, or does it stay there on average? >>>> >>>> Can I take a look at the coaster logs from skenny's run on ranger? >>>> >>>> I'd also like to point out in as little offensive mode as I can, that >>>> I'm working 100% on I2U2 and my lack of getting more than lightly >>>> involved in this is a consequence of that. >>>> >>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >>>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From >>>>> here process 22395 is the child of the main java process >>>>> (bootstrap.jar) and is loading the CPU. >>>>> >>>>> I have coasters.log, worker-*log, swift logs, gram logs in >>>>> ~aespinosa/workflows/activelog/run06. This refers to a different run. >>>>> PID 15206 is the child java process of bootstrap.jar in here. >>>>> >>>>> top snapshot: >>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, >>>>> 0.55 >>>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie >>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, >>>>> 0.0%st >>>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers >>>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached >>>>> >>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java >>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top >>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 >>>>> globus-job-mana >>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd >>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash >>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash >>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash >>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java >>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 >>>>> globus-job-man >>>>> >>>>> ps snapshot: >>>>> >>>>> 22328 ? S 0:00 \_ /bin/bash >>>>> 22364 ? Sl 0:00 \_ >>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >>>>> >>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >>>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 >>>>> https://128.135.125.17:46519 11505253269 >>>>> 22395 ? SNl 6:29 \_ >>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >>>>> >>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >>>>> -Djava.security.egd=file:///dev/urandom -cp >>>>> >>>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde c9 >> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c. jar >> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou s_s >> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >>>>> >>>>> >>>>> 2009/7/13 Mihael Hategan : >>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>>>>> At the time we did not have a chance to gather detailed evidence, >>>>>>>>> but I >>>>>>>>> was surprised by two things: >>>>>>>>> >>>>>>>>> - that there were two Java processes and that one was so big. (Are >>>>>>>>> most >>>>>>>>> likely the active process was just a child thread of the main >>>>>>>>> process?) >>>>>>>> One java process is the bootstrap process (it downloads the coaster >>>>>>>> jars, sets up the environment and runs the coaster service). It has >>>>>>>> always been like this. Did you happen to capture the output of ps to >>>>>>>> a >>>>>>>> file? That would be useful, because from what you are suggesting, it >>>>>>>> appears that the bootstrap process is eating 100% CPU. That process >>>>>>>> should only be sleeping after the service is started. >>>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I >>>>>>> cant >>>>>>> locate it. >>>>>>> >>>>>>> As best as I can recall it showed the larger memory-footprint process >>>>>>> to >>>>>>> be relatively idle, and the smaller footprint process (about 275MB) to >>>>>>> be burning 100% of a CPU. >>>>>> Normally, the smaller footprint process should be the bootstrap. But >>>>>> that's why I would like the ps output, because it sounds odd. >>>>>> >>>>>>> Allan will try to get a snapshot of this shortly. >>>>>>> >>>>>>> If this observation if correct, whats the best way to find out where >>>>>>> its >>>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a >>>>>>> JVM that doesnt exit? >>>>>> Once I know where it is, I can look at the code and then we'll go from >>>>>> there. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >> > > > From hategan at mcs.anl.gov Mon Jul 13 17:34:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 17:34:07 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> Message-ID: <1247524447.25358.0.camel@localhost> Holly matrimony! I will go sit in a corner now. Very nice work Allan. Mihael On Mon, 2009-07-13 at 17:04 -0500, Allan Espinosa wrote: > hi, > > here is a patch which solves the cpu usage on the bootstrap coaster > service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch > > suggested svn log entry: > Added locks via wait() and notify() to prevent busy waiting/ > active polling in the block task queue. > > > Test 2000 touch job using 066-many.swift via local:local : > before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 > after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 > > CPU usage drops from 100% to 0% with a few 25-40 % spikes! > > -Allan > > > 2009/7/13 Michael Wilde : > > Hi Allan, > > > > I think the methods you want for synchronization are part of class Object. > > > > They are documented in the chapter Threads and Locks of The Java Language > > Specification: > > > > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 > > > > queue.wait() should be called if the queue is empty. > > > > queue.notify() or .notifyall() should be called when something is added to > > the queue. I think notify() should work. > > > > .wait will I think take a timer, but suspect you dont need that. > > > > Both should be called within the synchronized(queue) constructs that are > > already in the code. > > > > Should be fun to fix this! > > > > - Mike > > > > > > > > > > > > On 7/13/09 2:12 PM, Allan Espinosa wrote: > >> > >> 97% is an average as can be seen in run06. swift version is r3005 and > >> cogkit r2410. this is a vanilla build of swift. > >> > >> 2009/7/13 Mihael Hategan : > >>> > >>> A while ago I committed a patch to run the service process with a lower > >>> priority. Is that in use? > >>> > >>> Also, is logging reduced or is it the default? > >>> > >>> Is the 97% CPU usage a spike, or does it stay there on average? > >>> > >>> Can I take a look at the coaster logs from skenny's run on ranger? > >>> > >>> I'd also like to point out in as little offensive mode as I can, that > >>> I'm working 100% on I2U2 and my lack of getting more than lightly > >>> involved in this is a consequence of that. > >>> > >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: > >>>> > >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From > >>>> here process 22395 is the child of the main java process > >>>> (bootstrap.jar) and is loading the CPU. > >>>> > >>>> I have coasters.log, worker-*log, swift logs, gram logs in > >>>> ~aespinosa/workflows/activelog/run06. This refers to a different run. > >>>> PID 15206 is the child java process of bootstrap.jar in here. > >>>> > >>>> top snapshot: > >>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, > >>>> 0.55 > >>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie > >>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, > >>>> 0.0%st > >>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers > >>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached > >>>> > >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java > >>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top > >>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 > >>>> globus-job-mana > >>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd > >>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash > >>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash > >>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash > >>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java > >>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 > >>>> globus-job-man > >>>> > >>>> ps snapshot: > >>>> > >>>> 22328 ? S 0:00 \_ /bin/bash > >>>> 22364 ? Sl 0:00 \_ > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java > >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= > >>>> > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar > >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 > >>>> https://128.135.125.17:46519 11505253269 > >>>> 22395 ? SNl 6:29 \_ > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M > >>>> > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu > >>>> -Djava.security.egd=file:///dev/urandom -cp > >>>> > >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcd ec9 > > > > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c .jar > > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvo us_s > > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc > >>>> > >>>> > >>>> > >>>> 2009/7/13 Mihael Hategan : > >>>>> > >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: > >>>>>>>> > >>>>>>>> At the time we did not have a chance to gather detailed evidence, > >>>>>>>> but I > >>>>>>>> was surprised by two things: > >>>>>>>> > >>>>>>>> - that there were two Java processes and that one was so big. (Are > >>>>>>>> most > >>>>>>>> likely the active process was just a child thread of the main > >>>>>>>> process?) > >>>>>>> > >>>>>>> One java process is the bootstrap process (it downloads the coaster > >>>>>>> jars, sets up the environment and runs the coaster service). It has > >>>>>>> always been like this. Did you happen to capture the output of ps to > >>>>>>> a > >>>>>>> file? That would be useful, because from what you are suggesting, it > >>>>>>> appears that the bootstrap process is eating 100% CPU. That process > >>>>>>> should only be sleeping after the service is started. > >>>>>> > >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I > >>>>>> cant > >>>>>> locate it. > >>>>>> > >>>>>> As best as I can recall it showed the larger memory-footprint process > >>>>>> to > >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to > >>>>>> be burning 100% of a CPU. > >>>>> > >>>>> Normally, the smaller footprint process should be the bootstrap. But > >>>>> that's why I would like the ps output, because it sounds odd. > >>>>> > >>>>>> Allan will try to get a snapshot of this shortly. > >>>>>> > >>>>>> If this observation if correct, whats the best way to find out where > >>>>>> its > >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a > >>>>>> JVM that doesnt exit? > >>>>> > >>>>> Once I know where it is, I can look at the code and then we'll go from > >>>>> there. > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > >> > > > > > > > From hategan at mcs.anl.gov Mon Jul 13 17:41:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 17:41:42 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> Message-ID: <1247524902.25358.3.camel@localhost> A slightly modified version of this is in cog r2429. Thanks again, Mihael On Mon, 2009-07-13 at 17:04 -0500, Allan Espinosa wrote: > hi, > > here is a patch which solves the cpu usage on the bootstrap coaster > service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch > > suggested svn log entry: > Added locks via wait() and notify() to prevent busy waiting/ > active polling in the block task queue. > > > Test 2000 touch job using 066-many.swift via local:local : > before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 > after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 > > CPU usage drops from 100% to 0% with a few 25-40 % spikes! > > -Allan > > > 2009/7/13 Michael Wilde : > > Hi Allan, > > > > I think the methods you want for synchronization are part of class Object. > > > > They are documented in the chapter Threads and Locks of The Java Language > > Specification: > > > > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 > > > > queue.wait() should be called if the queue is empty. > > > > queue.notify() or .notifyall() should be called when something is added to > > the queue. I think notify() should work. > > > > .wait will I think take a timer, but suspect you dont need that. > > > > Both should be called within the synchronized(queue) constructs that are > > already in the code. > > > > Should be fun to fix this! > > > > - Mike > > > > > > > > > > > > On 7/13/09 2:12 PM, Allan Espinosa wrote: > >> > >> 97% is an average as can be seen in run06. swift version is r3005 and > >> cogkit r2410. this is a vanilla build of swift. > >> > >> 2009/7/13 Mihael Hategan : > >>> > >>> A while ago I committed a patch to run the service process with a lower > >>> priority. Is that in use? > >>> > >>> Also, is logging reduced or is it the default? > >>> > >>> Is the 97% CPU usage a spike, or does it stay there on average? > >>> > >>> Can I take a look at the coaster logs from skenny's run on ranger? > >>> > >>> I'd also like to point out in as little offensive mode as I can, that > >>> I'm working 100% on I2U2 and my lack of getting more than lightly > >>> involved in this is a consequence of that. > >>> > >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: > >>>> > >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From > >>>> here process 22395 is the child of the main java process > >>>> (bootstrap.jar) and is loading the CPU. > >>>> > >>>> I have coasters.log, worker-*log, swift logs, gram logs in > >>>> ~aespinosa/workflows/activelog/run06. This refers to a different run. > >>>> PID 15206 is the child java process of bootstrap.jar in here. > >>>> > >>>> top snapshot: > >>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, > >>>> 0.55 > >>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie > >>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, > >>>> 0.0%st > >>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers > >>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached > >>>> > >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java > >>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top > >>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 > >>>> globus-job-mana > >>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd > >>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash > >>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash > >>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash > >>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java > >>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 > >>>> globus-job-man > >>>> > >>>> ps snapshot: > >>>> > >>>> 22328 ? S 0:00 \_ /bin/bash > >>>> 22364 ? Sl 0:00 \_ > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java > >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= > >>>> > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar > >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 > >>>> https://128.135.125.17:46519 11505253269 > >>>> 22395 ? SNl 6:29 \_ > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M > >>>> > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu > >>>> -Djava.security.egd=file:///dev/urandom -cp > >>>> > >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcd ec9 > > > > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c .jar > > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvo us_s > > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc > >>>> > >>>> > >>>> > >>>> 2009/7/13 Mihael Hategan : > >>>>> > >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: > >>>>>>>> > >>>>>>>> At the time we did not have a chance to gather detailed evidence, > >>>>>>>> but I > >>>>>>>> was surprised by two things: > >>>>>>>> > >>>>>>>> - that there were two Java processes and that one was so big. (Are > >>>>>>>> most > >>>>>>>> likely the active process was just a child thread of the main > >>>>>>>> process?) > >>>>>>> > >>>>>>> One java process is the bootstrap process (it downloads the coaster > >>>>>>> jars, sets up the environment and runs the coaster service). It has > >>>>>>> always been like this. Did you happen to capture the output of ps to > >>>>>>> a > >>>>>>> file? That would be useful, because from what you are suggesting, it > >>>>>>> appears that the bootstrap process is eating 100% CPU. That process > >>>>>>> should only be sleeping after the service is started. > >>>>>> > >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I > >>>>>> cant > >>>>>> locate it. > >>>>>> > >>>>>> As best as I can recall it showed the larger memory-footprint process > >>>>>> to > >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to > >>>>>> be burning 100% of a CPU. > >>>>> > >>>>> Normally, the smaller footprint process should be the bootstrap. But > >>>>> that's why I would like the ps output, because it sounds odd. > >>>>> > >>>>>> Allan will try to get a snapshot of this shortly. > >>>>>> > >>>>>> If this observation if correct, whats the best way to find out where > >>>>>> its > >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a > >>>>>> JVM that doesnt exit? > >>>>> > >>>>> Once I know where it is, I can look at the code and then we'll go from > >>>>> there. > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > >> > > > > > > > From skenny at uchicago.edu Mon Jul 13 17:41:50 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 13 Jul 2009 17:41:50 -0500 (CDT) Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <4A5BB276.1070902@mcs.anl.gov> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> <4A5BB276.1070902@mcs.anl.gov> Message-ID: <20090713174150.CAD74976@m4500-02.uchicago.edu> cool, i'll give this a shot now too. it's possible the other err i mentioned to you mike, was actually related to the stdout redirection. i wanted to test more, but trying not to wreak havoc on the headnode :P anyway, if this works, i can do more testing and will post if i'm still getting the error. ~sk ---- Original message ---- >Date: Mon, 13 Jul 2009 17:17:26 -0500 >From: Michael Wilde >Subject: Re: [Swift-devel] Coaster CPU-time consumption issue >To: Allan Espinosa , Sarah Kenny >Cc: swift-devel > >Nice! Now lets beat it up and see how well it works. > >Sarah: Allan did not encounter the error messages you mentioned to me. > >I suggest you do this: > >- post to the devel list the messages you got > >- test this patch to see if it clears up the problem > >Mike > > >On 7/13/09 5:04 PM, Allan Espinosa wrote: >> hi, >> >> here is a patch which solves the cpu usage on the bootstrap coaster >> service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch >> >> suggested svn log entry: >> Added locks via wait() and notify() to prevent busy waiting/ >> active polling in the block task queue. >> >> >> Test 2000 touch job using 066-many.swift via local:local : >> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 >> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 >> >> CPU usage drops from 100% to 0% with a few 25-40 % spikes! >> >> -Allan >> >> >> 2009/7/13 Michael Wilde : >>> Hi Allan, >>> >>> I think the methods you want for synchronization are part of class Object. >>> >>> They are documented in the chapter Threads and Locks of The Java Language >>> Specification: >>> >>> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 >>> >>> queue.wait() should be called if the queue is empty. >>> >>> queue.notify() or .notifyall() should be called when something is added to >>> the queue. I think notify() should work. >>> >>> .wait will I think take a timer, but suspect you dont need that. >>> >>> Both should be called within the synchronized(queue) constructs that are >>> already in the code. >>> >>> Should be fun to fix this! >>> >>> - Mike >>> >>> >>> >>> >>> >>> On 7/13/09 2:12 PM, Allan Espinosa wrote: >>>> 97% is an average as can be seen in run06. swift version is r3005 and >>>> cogkit r2410. this is a vanilla build of swift. >>>> >>>> 2009/7/13 Mihael Hategan : >>>>> A while ago I committed a patch to run the service process with a lower >>>>> priority. Is that in use? >>>>> >>>>> Also, is logging reduced or is it the default? >>>>> >>>>> Is the 97% CPU usage a spike, or does it stay there on average? >>>>> >>>>> Can I take a look at the coaster logs from skenny's run on ranger? >>>>> >>>>> I'd also like to point out in as little offensive mode as I can, that >>>>> I'm working 100% on I2U2 and my lack of getting more than lightly >>>>> involved in this is a consequence of that. >>>>> >>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From >>>>>> here process 22395 is the child of the main java process >>>>>> (bootstrap.jar) and is loading the CPU. >>>>>> >>>>>> I have coasters.log, worker-*log, swift logs, gram logs in >>>>>> ~aespinosa/workflows/activelog/run06. This refers to a different run. >>>>>> PID 15206 is the child java process of bootstrap.jar in here. >>>>>> >>>>>> top snapshot: >>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, >>>>>> 0.55 >>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie >>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, >>>>>> 0.0%st >>>>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers >>>>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached >>>>>> >>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java >>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top >>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 >>>>>> globus-job-mana >>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd >>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash >>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash >>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash >>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java >>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 >>>>>> globus-job-man >>>>>> >>>>>> ps snapshot: >>>>>> >>>>>> 22328 ? S 0:00 \_ /bin/bash >>>>>> 22364 ? Sl 0:00 \_ >>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >>>>>> >>>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >>>>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 >>>>>> https://128.135.125.17:46519 11505253269 >>>>>> 22395 ? SNl 6:29 \_ >>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >>>>>> >>>>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >>>>>> -Djava.security.egd=file:///dev/urandom -cp >>>>>> >>>>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde >c9 >>> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c. >jar >>> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou >s_s >>> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >>>>>> >>>>>> >>>>>> 2009/7/13 Mihael Hategan : >>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>>>>>> At the time we did not have a chance to gather detailed evidence, >>>>>>>>>> but I >>>>>>>>>> was surprised by two things: >>>>>>>>>> >>>>>>>>>> - that there were two Java processes and that one was so big. (Are >>>>>>>>>> most >>>>>>>>>> likely the active process was just a child thread of the main >>>>>>>>>> process?) >>>>>>>>> One java process is the bootstrap process (it downloads the coaster >>>>>>>>> jars, sets up the environment and runs the coaster service). It has >>>>>>>>> always been like this. Did you happen to capture the output of ps to >>>>>>>>> a >>>>>>>>> file? That would be useful, because from what you are suggesting, it >>>>>>>>> appears that the bootstrap process is eating 100% CPU. That process >>>>>>>>> should only be sleeping after the service is started. >>>>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I >>>>>>>> cant >>>>>>>> locate it. >>>>>>>> >>>>>>>> As best as I can recall it showed the larger memory-footprint process >>>>>>>> to >>>>>>>> be relatively idle, and the smaller footprint process (about 275MB) to >>>>>>>> be burning 100% of a CPU. >>>>>>> Normally, the smaller footprint process should be the bootstrap. But >>>>>>> that's why I would like the ps output, because it sounds odd. >>>>>>> >>>>>>>> Allan will try to get a snapshot of this shortly. >>>>>>>> >>>>>>>> If this observation if correct, whats the best way to find out where >>>>>>>> its >>>>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a >>>>>>>> JVM that doesnt exit? >>>>>>> Once I know where it is, I can look at the code and then we'll go from >>>>>>> there. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> >> From wilde at mcs.anl.gov Mon Jul 13 18:00:33 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 18:00:33 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <20090713174150.CAD74976@m4500-02.uchicago.edu> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> <4A5BB276.1070902@mcs.anl.gov> <20090713174150.CAD74976@m4500-02.uchicago.edu> Message-ID: <4A5BBC91.9060005@mcs.anl.gov> OK. Best to test with Mihael's cog r2429. I hope that this ends the latest head-node havoc :) Please post either way, so we know if the other problem remains or not. Thanks, - Mike On 7/13/09 5:41 PM, skenny at uchicago.edu wrote: > cool, i'll give this a shot now too. > > it's possible the other err i mentioned to you mike, was > actually related to the stdout redirection. i wanted to test > more, but trying not to wreak havoc on the headnode :P anyway, > if this works, i can do more testing and will post if i'm > still getting the error. > > ~sk > > ---- Original message ---- >> Date: Mon, 13 Jul 2009 17:17:26 -0500 >> From: Michael Wilde >> Subject: Re: [Swift-devel] Coaster CPU-time consumption issue >> To: Allan Espinosa , Sarah Kenny > >> Cc: swift-devel >> >> Nice! Now lets beat it up and see how well it works. >> >> Sarah: Allan did not encounter the error messages you > mentioned to me. >> I suggest you do this: >> >> - post to the devel list the messages you got >> >> - test this patch to see if it clears up the problem >> >> Mike >> >> >> On 7/13/09 5:04 PM, Allan Espinosa wrote: >>> hi, >>> >>> here is a patch which solves the cpu usage on the bootstrap > coaster >>> service: > http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch >>> suggested svn log entry: >>> Added locks via wait() and notify() to prevent busy > waiting/ >>> active polling in the block task queue. >>> >>> >>> Test 2000 touch job using 066-many.swift via local:local : >>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 >>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 >>> >>> CPU usage drops from 100% to 0% with a few 25-40 % spikes! >>> >>> -Allan >>> >>> >>> 2009/7/13 Michael Wilde : >>>> Hi Allan, >>>> >>>> I think the methods you want for synchronization are part > of class Object. >>>> They are documented in the chapter Threads and Locks of > The Java Language >>>> Specification: >>>> >>>> > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 >>>> queue.wait() should be called if the queue is empty. >>>> >>>> queue.notify() or .notifyall() should be called when > something is added to >>>> the queue. I think notify() should work. >>>> >>>> .wait will I think take a timer, but suspect you dont need > that. >>>> Both should be called within the synchronized(queue) > constructs that are >>>> already in the code. >>>> >>>> Should be fun to fix this! >>>> >>>> - Mike >>>> >>>> >>>> >>>> >>>> >>>> On 7/13/09 2:12 PM, Allan Espinosa wrote: >>>>> 97% is an average as can be seen in run06. swift version > is r3005 and >>>>> cogkit r2410. this is a vanilla build of swift. >>>>> >>>>> 2009/7/13 Mihael Hategan : >>>>>> A while ago I committed a patch to run the service > process with a lower >>>>>> priority. Is that in use? >>>>>> >>>>>> Also, is logging reduced or is it the default? >>>>>> >>>>>> Is the 97% CPU usage a spike, or does it stay there on > average? >>>>>> Can I take a look at the coaster logs from skenny's run > on ranger? >>>>>> I'd also like to point out in as little offensive mode > as I can, that >>>>>> I'm working 100% on I2U2 and my lack of getting more > than lightly >>>>>> involved in this is a consequence of that. >>>>>> >>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored > tp-osg. From >>>>>>> here process 22395 is the child of the main java process >>>>>>> (bootstrap.jar) and is loading the CPU. >>>>>>> >>>>>>> I have coasters.log, worker-*log, swift logs, gram logs in >>>>>>> ~aespinosa/workflows/activelog/run06. This refers to a > different run. >>>>>>> PID 15206 is the child java process of bootstrap.jar > in here. >>>>>>> top snapshot: >>>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load > average: 1.18, 0.80, >>>>>>> 0.55 >>>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0 > stopped, 0 zombie >>>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, > 0.0%hi, 0.0%si, >>>>>>> 0.0%st >>>>>>> Mem: 4058916k total, 3889864k used, 169052k free, > 239688k buffers >>>>>>> Swap: 4192956k total, 96k used, 4192860k free, > 2504812k cached >>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM > TIME+ COMMAND >>>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 > 4:29.22 java >>>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 > 0:00.50 top >>>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 > 0:00.10 >>>>>>> globus-job-mana >>>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 > 0:00.06 sshd >>>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 > 0:00.18 bash >>>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 > 0:00.00 bash >>>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 > 0:00.00 bash >>>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 > 0:00.20 java >>>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 > 0:00.09 >>>>>>> globus-job-man >>>>>>> >>>>>>> ps snapshot: >>>>>>> >>>>>>> 22328 ? S 0:00 \_ /bin/bash >>>>>>> 22364 ? Sl 0:00 \_ >>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java > -DGLOBUS_TCP_PORT_RANGE= >>>>>>> > -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>>>> -DX509_CERT_DIR= > -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >>>>>>> /tmp/bootstrap.w22332 > http://communicado.ci.uchicago.edu:46520 >>>>>>> https://128.135.125.17:46519 11505253269 >>>>>>> 22395 ? SNl 6:29 \_ >>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >>>>>>> >>>>>>> > -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >>>>>>> -Djava.security.egd=file:///dev/urandom -cp >>>>>>> >>>>>>> > /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde >> c9 > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c. >> jar > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou >> s_s > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >>>>>>> >>>>>>> 2009/7/13 Mihael Hategan : >>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>>>>>>> At the time we did not have a chance to gather > detailed evidence, >>>>>>>>>>> but I >>>>>>>>>>> was surprised by two things: >>>>>>>>>>> >>>>>>>>>>> - that there were two Java processes and that one > was so big. (Are >>>>>>>>>>> most >>>>>>>>>>> likely the active process was just a child thread > of the main >>>>>>>>>>> process?) >>>>>>>>>> One java process is the bootstrap process (it > downloads the coaster >>>>>>>>>> jars, sets up the environment and runs the coaster > service). It has >>>>>>>>>> always been like this. Did you happen to capture the > output of ps to >>>>>>>>>> a >>>>>>>>>> file? That would be useful, because from what you > are suggesting, it >>>>>>>>>> appears that the bootstrap process is eating 100% > CPU. That process >>>>>>>>>> should only be sleeping after the service is started. >>>>>>>>> I *thought* I captured the output of "top -u > sarahs'id -b -d" but I >>>>>>>>> cant >>>>>>>>> locate it. >>>>>>>>> >>>>>>>>> As best as I can recall it showed the larger > memory-footprint process >>>>>>>>> to >>>>>>>>> be relatively idle, and the smaller footprint process > (about 275MB) to >>>>>>>>> be burning 100% of a CPU. >>>>>>>> Normally, the smaller footprint process should be the > bootstrap. But >>>>>>>> that's why I would like the ps output, because it > sounds odd. >>>>>>>>> Allan will try to get a snapshot of this shortly. >>>>>>>>> >>>>>>>>> If this observation if correct, whats the best way to > find out where >>>>>>>>> its >>>>>>>>> spinning? Profiling? Debug logging? Can you get > profiling data from a >>>>>>>>> JVM that doesnt exit? >>>>>>>> Once I know where it is, I can look at the code and > then we'll go from >>>>>>>> there. >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>> >>> From tiberius at ci.uchicago.edu Mon Jul 13 18:21:22 2009 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 13 Jul 2009 18:21:22 -0500 Subject: [Swift-devel] Functionality request: best effort execution Message-ID: Hi Swift team I am curious if there is a way of coding up (or having in the near future) the following functionality: (file output) applicationWrapper(file input){ appOutput = runAtomicApplication(input); dummyOutput = runTimer (); if (Atomic Application Finished First){ output = appOutput; } else { output = dummyOutput; } } I am not sure how to tell swift to stop waiting for the second task, as soon as the first one has completed successfully. Thank you Tibi -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Mon Jul 13 18:59:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 18:59:24 -0500 Subject: [Swift-devel] Functionality request: best effort execution In-Reply-To: References: Message-ID: <1247529564.27051.3.camel@localhost> That somewhat crosses over the fence of time-agnostic/sequence-independent nature that swift is in. Can you implement this as part of your application (i.e. a wrapper script)? On Mon, 2009-07-13 at 18:21 -0500, Tiberiu Stef-Praun wrote: > Hi Swift team > > I am curious if there is a way of coding up (or having in the near > future) the following functionality: > > (file output) applicationWrapper(file input){ > appOutput = runAtomicApplication(input); > dummyOutput = runTimer (); > > if (Atomic Application Finished First){ > output = appOutput; > } else { > output = dummyOutput; > } > } > > > I am not sure how to tell swift to stop waiting for the second task, > as soon as the first one has completed successfully. > > Thank you > Tibi > From wilde at mcs.anl.gov Mon Jul 13 19:05:38 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Jul 2009 19:05:38 -0500 Subject: [Swift-devel] Functionality request: best effort execution In-Reply-To: <1247529564.27051.3.camel@localhost> References: <1247529564.27051.3.camel@localhost> Message-ID: <4A5BCBD2.2070600@mcs.anl.gov> On 7/13/09 6:59 PM, Mihael Hategan wrote: > That somewhat crosses over the fence of > time-agnostic/sequence-independent nature that swift is in. > > Can you implement this as part of your application (i.e. a wrapper > script)? I agree - I think the logic below could be done in a shell script fairly simply, Tibi. - Mike > > On Mon, 2009-07-13 at 18:21 -0500, Tiberiu Stef-Praun wrote: >> Hi Swift team >> >> I am curious if there is a way of coding up (or having in the near >> future) the following functionality: >> >> (file output) applicationWrapper(file input){ >> appOutput = runAtomicApplication(input); >> dummyOutput = runTimer (); >> >> if (Atomic Application Finished First){ >> output = appOutput; >> } else { >> output = dummyOutput; >> } >> } >> >> >> I am not sure how to tell swift to stop waiting for the second task, >> as soon as the first one has completed successfully. >> >> Thank you >> Tibi >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From tiberius at ci.uchicago.edu Mon Jul 13 20:22:57 2009 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 13 Jul 2009 20:22:57 -0500 Subject: [Swift-devel] Functionality request: best effort execution In-Reply-To: <4A5BCBD2.2070600@mcs.anl.gov> References: <1247529564.27051.3.camel@localhost> <4A5BCBD2.2070600@mcs.anl.gov> Message-ID: I am trying to control for runaway tasks, not just to simulate them. The scenario is for tasks which are waiting in the queue, in which case the wrapper script will not be able to implement the timeout functionality (because the tasks are not executed yet). For this reason, I wanted Swift to be aware of time-limited jobs, and give up on them without an error message (by defaulting to a "dummy" output). I am wondering if I can use globus::maxwalltime as a timeout mechanism ? My current solution is to have a task run locally and the other one remotely, and to use the local tasks' timeout as a barrier to generating the dummy output or to validating the remote result as the proper output. I know I am pushing the limits here, that's what I pretty much do all the time with Swift. Tibi On Mon, Jul 13, 2009 at 7:05 PM, Michael Wilde wrote: > > > On 7/13/09 6:59 PM, Mihael Hategan wrote: >> >> That somewhat crosses over the fence of >> time-agnostic/sequence-independent nature that swift is in. >> >> Can you implement this as part of your application (i.e. a wrapper >> script)? > > I agree - I think the logic below could be done in a shell script fairly > simply, Tibi. > > - Mike > >> >> On Mon, 2009-07-13 at 18:21 -0500, Tiberiu Stef-Praun wrote: >>> >>> Hi Swift team >>> >>> I am curious if there is a way of coding up (or having in the near >>> future) the following functionality: >>> >>> (file output) applicationWrapper(file input){ >>> ? appOutput = runAtomicApplication(input); >>> ? dummyOutput = runTimer (); >>> >>> ? if (Atomic Application Finished First){ >>> ? ? ?output = appOutput; >>> ?} else { >>> ? ? output = dummyOutput; >>> ?} >>> } >>> >>> >>> I am not sure how to tell swift to stop waiting for the second task, >>> as soon as the first one has completed successfully. >>> >>> Thank you >>> Tibi >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Mon Jul 13 20:52:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 20:52:22 -0500 Subject: [Swift-devel] Functionality request: best effort execution In-Reply-To: References: <1247529564.27051.3.camel@localhost> <4A5BCBD2.2070600@mcs.anl.gov> Message-ID: <1247536342.28535.20.camel@localhost> On Mon, 2009-07-13 at 20:22 -0500, Tiberiu Stef-Praun wrote: > I am trying to control for runaway tasks, not just to simulate them. > The scenario is for tasks which are waiting in the queue, in which > case the wrapper script will not be able to implement the timeout > functionality (because the tasks are not executed yet). > For this reason, I wanted Swift to be aware of time-limited jobs, and > give up on them without an error message (by defaulting to a "dummy" > output). It would make your workflow nondeterministic depending on the resources you run, including possibly giving you only dummy results without as much as a single complaint. Are you sure this is what you want? In a sense, with swift, I think we're trying to eliminate this kind of nondeterministic behavior that is common in strict language concurrency, but that also means we need to restrict certain things. I can see applications to this in that there are problems that are time-sensitive (some things may only be useful if done before a certain deadline). So I'm unsure about the following: - whether this is a language issue, or something for the runtime - whether swift should support this kind of process control - what the consequences of this would be to the system in general (including but not limited to the possibility of implementing a "virtual data" thing with it and the ability to have reproducible experiments). - whether there is a middle ground, such as isolating side-effects like this (Ben would mention haskell and monads about here). > > I am wondering if I can use globus::maxwalltime as a timeout mechanism ? maxwalltime applies to the actual job (not queue times) so it's worse than a wrapper script, because as opposed to a wrapper script where you can gracefully supply a dummy result, violating maxwalltime results in an error. > My current solution is to have a task run locally and the other one > remotely, and to use the local tasks' timeout as a barrier to > generating the dummy output or to validating the remote result as the > proper output. > > I know I am pushing the limits here, that's what I pretty much do all > the time with Swift. I don't think this is a discussion about mechanisms, since for that there already is a solution in karajan called "race" (a discriminator in "workflow" terms) which (theoretically) takes care of the cleanup including canceling the branches that lost and any jobs that they might have launched. From skenny at uchicago.edu Mon Jul 13 21:54:11 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 13 Jul 2009 21:54:11 -0500 (CDT) Subject: [Swift-devel] Coasters and std's on ranger Message-ID: <20090713215411.CAD90960@m4500-02.uchicago.edu> so, here is the swift error i currently get running a 50-job workflow with the latest code on ranger: Execution failed: Exception in RInvoke: Arguments: [scripts/4reg_dummy.R, matrices/4_reg/network1/gestspeech.cov, 31, 0.5, speech] Host: RANGER Directory: 4reg_speech-20090713-2127-tbl7ou0e/jobs/f/RInvoke-f57xpmdj stderr.txt: stdout.txt: ---- Caused by: Block task failed: org.globus.gram.GramException: The job manager could not stage out a file at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:619) Cleaning up... Shutting down service at https://129.114.50.163:36721 Got channel MetaChannel: 24980848 -> GSSSChannel-null(1) - Done gram log shows this: 7/13 21:37:58 JM: sending callback of status 4 (failure code 155) to https://128.135.125.211:50003/1247538475621. 7/13 21:37:58 JMI: testing job manager scripts for type fork exist and permissions are ok. this is the same error i was getting on ranger running without coasters prior to commenting out the redirection of stdout and stderr (which corrected the error for provider-gt2). is there a redirection of these std's going on in provider-coaster that can be corrected somehow? ~sk p.s. let me know if anyone would like the swift log for this. ---- Original message ---- >Date: Mon, 13 Jul 2009 18:00:33 -0500 >From: Michael Wilde >Subject: Re: [Swift-devel] Coaster CPU-time consumption issue >To: skenny at uchicago.edu >Cc: Allan Espinosa , swift-devel > >OK. Best to test with Mihael's cog r2429. >I hope that this ends the latest head-node havoc :) > >Please post either way, so we know if the other problem remains or not. > >Thanks, > >- Mike > > >On 7/13/09 5:41 PM, skenny at uchicago.edu wrote: >> cool, i'll give this a shot now too. >> >> it's possible the other err i mentioned to you mike, was >> actually related to the stdout redirection. i wanted to test >> more, but trying not to wreak havoc on the headnode :P anyway, >> if this works, i can do more testing and will post if i'm >> still getting the error. >> >> ~sk >> >> ---- Original message ---- >>> Date: Mon, 13 Jul 2009 17:17:26 -0500 >>> From: Michael Wilde >>> Subject: Re: [Swift-devel] Coaster CPU-time consumption issue >>> To: Allan Espinosa , Sarah Kenny >> >>> Cc: swift-devel >>> >>> Nice! Now lets beat it up and see how well it works. >>> >>> Sarah: Allan did not encounter the error messages you >> mentioned to me. >>> I suggest you do this: >>> >>> - post to the devel list the messages you got >>> >>> - test this patch to see if it clears up the problem >>> >>> Mike >>> >>> >>> On 7/13/09 5:04 PM, Allan Espinosa wrote: >>>> hi, >>>> >>>> here is a patch which solves the cpu usage on the bootstrap >> coaster >>>> service: >> http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch >>>> suggested svn log entry: >>>> Added locks via wait() and notify() to prevent busy >> waiting/ >>>> active polling in the block task queue. >>>> >>>> >>>> Test 2000 touch job using 066-many.swift via local:local : >>>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 >>>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 >>>> >>>> CPU usage drops from 100% to 0% with a few 25-40 % spikes! >>>> >>>> -Allan >>>> >>>> >>>> 2009/7/13 Michael Wilde : >>>>> Hi Allan, >>>>> >>>>> I think the methods you want for synchronization are part >> of class Object. >>>>> They are documented in the chapter Threads and Locks of >> The Java Language >>>>> Specification: >>>>> >>>>> >> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 >>>>> queue.wait() should be called if the queue is empty. >>>>> >>>>> queue.notify() or .notifyall() should be called when >> something is added to >>>>> the queue. I think notify() should work. >>>>> >>>>> .wait will I think take a timer, but suspect you dont need >> that. >>>>> Both should be called within the synchronized(queue) >> constructs that are >>>>> already in the code. >>>>> >>>>> Should be fun to fix this! >>>>> >>>>> - Mike >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 7/13/09 2:12 PM, Allan Espinosa wrote: >>>>>> 97% is an average as can be seen in run06. swift version >> is r3005 and >>>>>> cogkit r2410. this is a vanilla build of swift. >>>>>> >>>>>> 2009/7/13 Mihael Hategan : >>>>>>> A while ago I committed a patch to run the service >> process with a lower >>>>>>> priority. Is that in use? >>>>>>> >>>>>>> Also, is logging reduced or is it the default? >>>>>>> >>>>>>> Is the 97% CPU usage a spike, or does it stay there on >> average? >>>>>>> Can I take a look at the coaster logs from skenny's run >> on ranger? >>>>>>> I'd also like to point out in as little offensive mode >> as I can, that >>>>>>> I'm working 100% on I2U2 and my lack of getting more >> than lightly >>>>>>> involved in this is a consequence of that. >>>>>>> >>>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: >>>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored >> tp-osg. From >>>>>>>> here process 22395 is the child of the main java process >>>>>>>> (bootstrap.jar) and is loading the CPU. >>>>>>>> >>>>>>>> I have coasters.log, worker-*log, swift logs, gram logs in >>>>>>>> ~aespinosa/workflows/activelog/run06. This refers to a >> different run. >>>>>>>> PID 15206 is the child java process of bootstrap.jar >> in here. >>>>>>>> top snapshot: >>>>>>>> top - 13:49:03 up 55 days, 1:45, 1 user, load >> average: 1.18, 0.80, >>>>>>>> 0.55 >>>>>>>> Tasks: 121 total, 1 running, 120 sleeping, 0 >> stopped, 0 zombie >>>>>>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, >> 0.0%hi, 0.0%si, >>>>>>>> 0.0%st >>>>>>>> Mem: 4058916k total, 3889864k used, 169052k free, >> 239688k buffers >>>>>>>> Swap: 4192956k total, 96k used, 4192860k free, >> 2504812k cached >>>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM >> TIME+ COMMAND >>>>>>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 >> 4:29.22 java >>>>>>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 >> 0:00.50 top >>>>>>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 >> 0:00.10 >>>>>>>> globus-job-mana >>>>>>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 >> 0:00.06 sshd >>>>>>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 >> 0:00.18 bash >>>>>>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 >> 0:00.00 bash >>>>>>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 >> 0:00.00 bash >>>>>>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 >> 0:00.20 java >>>>>>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 >> 0:00.09 >>>>>>>> globus-job-man >>>>>>>> >>>>>>>> ps snapshot: >>>>>>>> >>>>>>>> 22328 ? S 0:00 \_ /bin/bash >>>>>>>> 22364 ? Sl 0:00 \_ >>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >>>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java >> -DGLOBUS_TCP_PORT_RANGE= >>>>>>>> >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>>>>> -DX509_CERT_DIR= >> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar >>>>>>>> /tmp/bootstrap.w22332 >> http://communicado.ci.uchicago.edu:46520 >>>>>>>> https://128.135.125.17:46519 11505253269 >>>>>>>> 22395 ? SNl 6:29 \_ >>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M >>>>>>>> >>>>>>>> >> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up >>>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu >>>>>>>> -Djava.security.egd=file:///dev/urandom -cp >>>>>>>> >>>>>>>> >> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde >>> c9 >> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c. >>> jar >> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou >>> s_s >> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc >>>>>>>> >>>>>>>> 2009/7/13 Mihael Hategan : >>>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: >>>>>>>>>>>> At the time we did not have a chance to gather >> detailed evidence, >>>>>>>>>>>> but I >>>>>>>>>>>> was surprised by two things: >>>>>>>>>>>> >>>>>>>>>>>> - that there were two Java processes and that one >> was so big. (Are >>>>>>>>>>>> most >>>>>>>>>>>> likely the active process was just a child thread >> of the main >>>>>>>>>>>> process?) >>>>>>>>>>> One java process is the bootstrap process (it >> downloads the coaster >>>>>>>>>>> jars, sets up the environment and runs the coaster >> service). It has >>>>>>>>>>> always been like this. Did you happen to capture the >> output of ps to >>>>>>>>>>> a >>>>>>>>>>> file? That would be useful, because from what you >> are suggesting, it >>>>>>>>>>> appears that the bootstrap process is eating 100% >> CPU. That process >>>>>>>>>>> should only be sleeping after the service is started. >>>>>>>>>> I *thought* I captured the output of "top -u >> sarahs'id -b -d" but I >>>>>>>>>> cant >>>>>>>>>> locate it. >>>>>>>>>> >>>>>>>>>> As best as I can recall it showed the larger >> memory-footprint process >>>>>>>>>> to >>>>>>>>>> be relatively idle, and the smaller footprint process >> (about 275MB) to >>>>>>>>>> be burning 100% of a CPU. >>>>>>>>> Normally, the smaller footprint process should be the >> bootstrap. But >>>>>>>>> that's why I would like the ps output, because it >> sounds odd. >>>>>>>>>> Allan will try to get a snapshot of this shortly. >>>>>>>>>> >>>>>>>>>> If this observation if correct, whats the best way to >> find out where >>>>>>>>>> its >>>>>>>>>> spinning? Profiling? Debug logging? Can you get >> profiling data from a >>>>>>>>>> JVM that doesnt exit? >>>>>>>>> Once I know where it is, I can look at the code and >> then we'll go from >>>>>>>>> there. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>> >>>> From hategan at mcs.anl.gov Mon Jul 13 22:05:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 22:05:51 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <20090713215411.CAD90960@m4500-02.uchicago.edu> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> Message-ID: <1247540751.30172.5.camel@localhost> On Mon, 2009-07-13 at 21:54 -0500, skenny at uchicago.edu wrote: [...] > gram log shows this: > > 7/13 21:37:58 JM: sending callback of status 4 (failure code > 155) to https://128.135.125.211:50003/1247538475621. > 7/13 21:37:58 JMI: testing job manager scripts for type fork > exist and permissions are ok. > > this is the same error i was getting on ranger running without > coasters prior to commenting out the redirection of stdout and > stderr (which corrected the error for provider-gt2). I am afraid then that this is an incurable problem with the current SGE job manager. I think there are two ways of dealing with this: 1. Report the problem to the folks who developed the SGE job manager and hope it will get fixed and deployed on ranger 2. Write a local SGE provider [/me ducks while Ian throws various objects in my general direction] From hategan at mcs.anl.gov Mon Jul 13 22:18:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Jul 2009 22:18:58 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247540751.30172.5.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> Message-ID: <1247541538.30172.8.camel@localhost> On Mon, 2009-07-13 at 22:05 -0500, Mihael Hategan wrote: > On Mon, 2009-07-13 at 21:54 -0500, skenny at uchicago.edu wrote: > [...] > > gram log shows this: > > > > 7/13 21:37:58 JM: sending callback of status 4 (failure code > > 155) to https://128.135.125.211:50003/1247538475621. > > 7/13 21:37:58 JMI: testing job manager scripts for type fork > > exist and permissions are ok. > > > > this is the same error i was getting on ranger running without > > coasters prior to commenting out the redirection of stdout and > > stderr (which corrected the error for provider-gt2). > > I am afraid then that this is an incurable problem with the current SGE > job manager. Or not... I see that in the current coaster code the stdout of the block task is always redirected. Try cog r2430 and keep the commented lines commented in the gt2 provider. From skenny at uchicago.edu Tue Jul 14 02:05:26 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 14 Jul 2009 02:05:26 -0500 (CDT) Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247541538.30172.8.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> Message-ID: <20090714020526.CAE04557@m4500-02.uchicago.edu> >I see that in the current coaster code the stdout of the block task is >always redirected. > >Try cog r2430 and keep the commented lines commented in the gt2 >provider. 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn swift-r3005 cog-r2430 (cog modified locally) Execution failed: Exception in RInvoke: Arguments: [scripts/4reg_dummy.R, matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech] Host: RANGER Directory: 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj stderr.txt: stdout.txt: ---- Caused by: Block task failed: 0714-140152-000000Block task ended prematurely Progress: Submitted:18 Failed:16 Finished successfully:16 Cleaning up... gram log: 7/14 01:25:44 JM: sending callback of status 4 (failure code 155) to https://128.135.125.211:50003/1247552072425. 7/14 01:25:44 JMI: testing job manager scripts for type fork exist and permissions are ok. From benc at hawaga.org.uk Tue Jul 14 02:09:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 14 Jul 2009 07:09:16 +0000 (GMT) Subject: [Swift-devel] Functionality request: best effort execution In-Reply-To: References: Message-ID: One way of putting in ambiguity here is something like the AMB(iguous) operator, which looks very similar to Karajan's race behaviour. a AMB b evaluates to either a or b but its not defined which and so the runtime can pick which. That has no particular preference for a result, though in Tibi's use case one of the results is probably preferred. You could change the semantics so that it returns a unless a fails in which case it evaluates and returns b, unless b fails in which case the expression fails to evaluate. Both of the above descriptions can be extended to more than two operands in a natural way. -- From bugzilla-daemon at mcs.anl.gov Tue Jul 14 06:29:00 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 14 Jul 2009 06:29:00 -0500 (CDT) Subject: [Swift-devel] [Bug 210] job exceeding wallclock limit -- error is not reported by swift In-Reply-To: References: Message-ID: <20090714112900.40D302CB0F@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=210 --- Comment #1 from Ben Clifford 2009-07-14 06:29:00 --- This bug is rather ambiguously described. In non-bugzilla discussion it has been reported as: > well, for some reason, when a job hits wallclock and is killed by the JM, swift just keeps saying "active" This is not behaviour that I observe with Swift against NCSA using the below swiftscript and configuration using Swift swift-r3006 cog-r2430 - in such case, I see the job fail three times in a row and then the example SwiftScript fails as should happen. Please clarify this bug. s.swift: $ cat s.swift type messagefile; app (messagefile t) greeting() { sleep "999s" stdout=@filename(t); } messagefile outfile <"hello.txt">; outfile = greeting(); tc.data: $ cat tc.data cat: tc.data: No such file or directory benc at communicado:~/tmp-walltime/cog/modules/swift !1055 $ cat dist/swift-svn/etc/tc.data #This is the transformation catalog. # #It comes pre-configured with a number of simple transformations with #paths that are likely to work on a linux box. However, on some systems, #the paths to these executables will be different (for example, sometimes #some of these programs are found in /usr/bin rather than in /bin) # #NOTE WELL: fields in this file must be separated by tabs, not spaces; and #there must be no trailing whitespace at the end of each line. # # sitename transformation path INSTALLED platform profiles hg echo /bin/echo INSTALLED INTEL32::LINUX null hg cat /bin/cat INSTALLED INTEL32::LINUX null hg ls /bin/ls INSTALLED INTEL32::LINUX null hg grep /bin/grep INSTALLED INTEL32::LINUX null hg sort /bin/sort INSTALLED INTEL32::LINUX null hg sleep /bin/sleep INSTALLED INTEL32::LINUX null site definition: /home/ac/benc debug 1 the output: Swift svn swift-r3006 cog-r2430 RunID: 20090714-0616-dgktv8b3 Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Progress: Active:1 Progress: Active:1 Progress: Active:1 Progress: Checking status:1 Progress: Stage in:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Progress: Active:1 Progress: Active:1 Progress: Checking status:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Progress: Active:1 Progress: Active:1 Progress: Checking status:1 Execution failed: Exception in sleep: Arguments: [999s] Host: hg Directory: s-20090714-0616-dgktv8b3/jobs/8/sleep-8h82cndj stderr.txt: stdout.txt: ---- Caused by: No status file was found. Check the shared filesystem on hg -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. From wilde at mcs.anl.gov Tue Jul 14 08:30:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 14 Jul 2009 08:30:22 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <20090714020526.CAE04557@m4500-02.uchicago.edu> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> Message-ID: <4A5C886E.8070300@mcs.anl.gov> Sarah, if its still broken on Thu, I will look at it then. I assume it happens on single job runs as well. Can you create a simple 1-job test directory on Ranger that I can copy to reproduce the problem? Ben, if you can solve this, this week, that would be great. Else Allan and I will look at it; guidance welcome. Thanks, Mike On 7/14/09 2:05 AM, skenny at uchicago.edu wrote: >> I see that in the current coaster code the stdout of the > block task is >> always redirected. >> >> Try cog r2430 and keep the commented lines commented in the gt2 >> provider. > > 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn > swift-r3005 cog-r2430 (cog modified locally) > > Execution failed: > Exception in RInvoke: > Arguments: [scripts/4reg_dummy.R, > matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech] > Host: RANGER > Directory: > 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Block task failed: 0714-140152-000000Block task ended > prematurely > > Progress: Submitted:18 Failed:16 Finished successfully:16 > Cleaning up... > > gram log: > > 7/14 01:25:44 JM: sending callback of status 4 (failure code > 155) to https://128.135.125.211:50003/1247552072425. > 7/14 01:25:44 JMI: testing job manager scripts for type fork > exist and permissions are ok. > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 14 09:59:01 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Jul 2009 09:59:01 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <20090714020526.CAE04557@m4500-02.uchicago.edu> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> Message-ID: <1247583541.1437.1.camel@localhost> I see. What happens is that redirection hasn't been fixed in SGE, but the commenting out of it in the gt2 provider did nothing because it was enabled in the coaster provider. There is one more thing to try, and that is to re-direct to a remote file, hoping it won't hit whatever problem it hits now. On Tue, 2009-07-14 at 02:05 -0500, skenny at uchicago.edu wrote: > >I see that in the current coaster code the stdout of the > block task is > >always redirected. > > > >Try cog r2430 and keep the commented lines commented in the gt2 > >provider. > > 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn > swift-r3005 cog-r2430 (cog modified locally) > > Execution failed: > Exception in RInvoke: > Arguments: [scripts/4reg_dummy.R, > matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech] > Host: RANGER > Directory: > 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Block task failed: 0714-140152-000000Block task ended > prematurely > > Progress: Submitted:18 Failed:16 Finished successfully:16 > Cleaning up... > > gram log: > > 7/14 01:25:44 JM: sending callback of status 4 (failure code > 155) to https://128.135.125.211:50003/1247552072425. > 7/14 01:25:44 JMI: testing job manager scripts for type fork > exist and permissions are ok. > > > > > From skenny at uchicago.edu Tue Jul 14 10:11:40 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 14 Jul 2009 10:11:40 -0500 (CDT) Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247583541.1437.1.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> Message-ID: <20090714101140.CAE36286@m4500-02.uchicago.edu> >I see. What happens is that redirection hasn't been fixed in SGE, but >the commenting out of it in the gt2 provider did nothing because it was >enabled in the coaster provider. right, i must've misunderstood, i had commented out redirection for the gt2 provider so swift would work for running w/o coasters, but i thought you were saying cog r2430 would be also redirecting for coasters as well...but apparently you were trying a different change? >There is one more thing to try, and that is to re-direct to a remote >file, hoping it won't hit whatever problem it hits now. so, can you tell me where in the code i can redirect std's for coasters? or, are you saying something else? :P >On Tue, 2009-07-14 at 02:05 -0500, skenny at uchicago.edu wrote: >> >I see that in the current coaster code the stdout of the >> block task is >> >always redirected. >> > >> >Try cog r2430 and keep the commented lines commented in the gt2 >> >provider. >> >> 2009-07-14 01:14:30,525-0500 INFO unknown Swift svn >> swift-r3005 cog-r2430 (cog modified locally) >> >> Execution failed: >> Exception in RInvoke: >> Arguments: [scripts/4reg_dummy.R, >> matrices/4_reg/network1/gestspeech.cov, 2, 0.5, speech] >> Host: RANGER >> Directory: >> 4reg_speech-20090714-0114-ad0vxv90/jobs/z/RInvoke-zlbzzmdj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Block task failed: 0714-140152-000000Block task ended >> prematurely >> >> Progress: Submitted:18 Failed:16 Finished successfully:16 >> Cleaning up... >> >> gram log: >> >> 7/14 01:25:44 JM: sending callback of status 4 (failure code >> 155) to https://128.135.125.211:50003/1247552072425. >> 7/14 01:25:44 JMI: testing job manager scripts for type fork >> exist and permissions are ok. >> >> >> >> >> > From hategan at mcs.anl.gov Tue Jul 14 10:22:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Jul 2009 10:22:26 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <20090714101140.CAE36286@m4500-02.uchicago.edu> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> Message-ID: <1247584946.1759.8.camel@localhost> On Tue, 2009-07-14 at 10:11 -0500, skenny at uchicago.edu wrote: > >I see. What happens is that redirection hasn't been fixed in > SGE, but > >the commenting out of it in the gt2 provider did nothing > because it was > >enabled in the coaster provider. > > right, i must've misunderstood, i had commented out > redirection for the gt2 provider so swift would work for > running w/o coasters, but i thought you were saying cog r2430 > would be also redirecting for coasters as well...but > apparently you were trying a different change? No. As I was mentioning the commenting out I forgot that the output is redirected anyway by the coaster code. So our little experiment then did nothing. Cog r2430 removed the explicit redirection in the coaster code. Without that and without the hack to always redirect for SGE in the gt2 provider that you commented out, there was no more redirection, so the SGE job manager bug surfaced. In cog r2431, there's redirection to a file. Do keep the lines commented in the gt2 provider. I'm not sure how that will work out, but please try and let me know. From skenny at uchicago.edu Tue Jul 14 12:38:36 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 14 Jul 2009 12:38:36 -0500 (CDT) Subject: [Swift-devel] [Bug 210] job exceeding wallclock limit -- error is not reported by swift In-Reply-To: <20090714112900.40D302CB0F@wind.mcs.anl.gov> References: <20090714112900.40D302CB0F@wind.mcs.anl.gov> Message-ID: <20090714123836.CAE58724@m4500-02.uchicago.edu> can you try resubmitting your test to ranger? ---- Original message ---- >Date: Tue, 14 Jul 2009 06:29:00 -0500 (CDT) >From: bugzilla-daemon at mcs.anl.gov >Subject: [Swift-devel] [Bug 210] job exceeding wallclock limit -- error is not reported by swift >To: swift-devel at ci.uchicago.edu > >https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=210 > > > > > >--- Comment #1 from Ben Clifford 2009-07-14 06:29:00 --- >This bug is rather ambiguously described. > >In non-bugzilla discussion it has been reported as: > >> well, for some reason, when a job hits wallclock and is killed by the JM, swift just keeps saying "active" > >This is not behaviour that I observe with Swift against NCSA using the below >swiftscript and configuration using Swift swift-r3006 cog-r2430 - in such case, >I see the job fail three times in a row and then the example SwiftScript fails >as should happen. > >Please clarify this bug. > >s.swift: > >$ cat s.swift >type messagefile; > >app (messagefile t) greeting() { > sleep "999s" stdout=@filename(t); >} > >messagefile outfile <"hello.txt">; > >outfile = greeting(); > > > >tc.data: > >$ cat tc.data >cat: tc.data: No such file or directory >benc at communicado:~/tmp-walltime/cog/modules/swift !1055 >$ cat dist/swift-svn/etc/tc.data >#This is the transformation catalog. ># >#It comes pre-configured with a number of simple transformations with >#paths that are likely to work on a linux box. However, on some systems, >#the paths to these executables will be different (for example, sometimes >#some of these programs are found in /usr/bin rather than in /bin) ># >#NOTE WELL: fields in this file must be separated by tabs, not spaces; and >#there must be no trailing whitespace at the end of each line. ># ># sitename transformation path INSTALLED platform profiles >hg echo /bin/echo INSTALLED INTEL32::LINUX null >hg cat /bin/cat INSTALLED INTEL32::LINUX null >hg ls /bin/ls INSTALLED INTEL32::LINUX null >hg grep /bin/grep INSTALLED INTEL32::LINUX null >hg sort /bin/sort INSTALLED INTEL32::LINUX null >hg sleep /bin/sleep INSTALLED INTEL32::LINUX null > > >site definition: > > > > url="grid-hg.ncsa.teragrid.org/jobmanager-pbs >" major="2" /> > /home/ac/benc > debug > 1 > > > >the output: > >Swift svn swift-r3006 cog-r2430 > >RunID: 20090714-0616-dgktv8b3 >Progress: >Progress: Stage in:1 >Progress: Submitted:1 >Progress: Submitted:1 >Progress: Submitted:1 >Progress: Active:1 >Progress: Active:1 >Progress: Active:1 >Progress: Active:1 >Progress: Checking status:1 >Progress: Stage in:1 >Progress: Submitted:1 >Progress: Submitted:1 >Progress: Active:1 >Progress: Active:1 >Progress: Active:1 >Progress: Checking status:1 >Progress: Submitted:1 >Progress: Submitted:1 >Progress: Submitted:1 >Progress: Active:1 >Progress: Active:1 >Progress: Active:1 >Progress: Checking status:1 >Execution failed: > Exception in sleep: >Arguments: [999s] >Host: hg >Directory: s-20090714-0616-dgktv8b3/jobs/8/sleep-8h82cndj >stderr.txt: >stdout.txt: >---- > >Caused by: > No status file was found. Check the shared filesystem on hg > >-- >Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email >------- You are receiving this mail because: ------- >You are watching the assignee of the bug. >You are watching someone on the CC list of the bug. >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From skenny at uchicago.edu Tue Jul 14 13:22:26 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 14 Jul 2009 13:22:26 -0500 (CDT) Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247584946.1759.8.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> Message-ID: <20090714132226.CAE63545@m4500-02.uchicago.edu> darn... Execution failed: Exception in RInvoke: Arguments: [scripts/4reg_dummy.R, matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] Host: RANGER Directory: 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj stderr.txt: stdout.txt: ---- Caused by: Block task failed: 0714-090151-000000Block task ended prematurely Cleaning up... Shutting down service at https://129.114.50.163:38571 i can file a bug report with TG if need be, but i'm not quite sure the best thing to tell them (?) also, i'm wondering how coasters was previously able to work around this bug? ~sk ---- Original message ---- >Date: Tue, 14 Jul 2009 10:22:26 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] Coasters and std's on ranger >To: skenny at uchicago.edu >Cc: swift-devel > >On Tue, 2009-07-14 at 10:11 -0500, skenny at uchicago.edu wrote: >> >I see. What happens is that redirection hasn't been fixed in >> SGE, but >> >the commenting out of it in the gt2 provider did nothing >> because it was >> >enabled in the coaster provider. >> >> right, i must've misunderstood, i had commented out >> redirection for the gt2 provider so swift would work for >> running w/o coasters, but i thought you were saying cog r2430 >> would be also redirecting for coasters as well...but >> apparently you were trying a different change? > >No. As I was mentioning the commenting out I forgot that the output is >redirected anyway by the coaster code. So our little experiment then did >nothing. > >Cog r2430 removed the explicit redirection in the coaster code. Without >that and without the hack to always redirect for SGE in the gt2 provider >that you commented out, there was no more redirection, so the SGE job >manager bug surfaced. > >In cog r2431, there's redirection to a file. Do keep the lines commented >in the gt2 provider. I'm not sure how that will work out, but please try >and let me know. > From hategan at mcs.anl.gov Tue Jul 14 14:21:57 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Jul 2009 14:21:57 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <20090714132226.CAE63545@m4500-02.uchicago.edu> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> Message-ID: <1247599317.7032.0.camel@localhost> On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: > darn... > > Execution failed: > Exception in RInvoke: > Arguments: [scripts/4reg_dummy.R, > matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] > Host: RANGER > Directory: > 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Block task failed: 0714-090151-000000Block task ended > prematurely > > Cleaning up... > Shutting down service at https://129.114.50.163:38571 > > i can file a bug report with TG if need be, but i'm not quite > sure the best thing to tell them (?) also, i'm wondering how > coasters was previously able to work around this bug? By redirecting stdout+stderr to memory, but that causes the "job manager could not stage out a file" problem. From wilde at mcs.anl.gov Tue Jul 14 14:29:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 14 Jul 2009 14:29:00 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247599317.7032.0.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> Message-ID: <4A5CDC7C.6070304@mcs.anl.gov> Will the current code work for swift programs that dont use stdout or stderr? (Ie where the app wrappers redirect these to a file?) - Mike On 7/14/09 2:21 PM, Mihael Hategan wrote: > On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: >> darn... >> >> Execution failed: >> Exception in RInvoke: >> Arguments: [scripts/4reg_dummy.R, >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] >> Host: RANGER >> Directory: >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Block task failed: 0714-090151-000000Block task ended >> prematurely >> >> Cleaning up... >> Shutting down service at https://129.114.50.163:38571 >> >> i can file a bug report with TG if need be, but i'm not quite >> sure the best thing to tell them (?) also, i'm wondering how >> coasters was previously able to work around this bug? > > By redirecting stdout+stderr to memory, but that causes the "job manager > could not stage out a file" problem. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 14 14:33:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Jul 2009 14:33:33 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <4A5CDC7C.6070304@mcs.anl.gov> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> <4A5CDC7C.6070304@mcs.anl.gov> Message-ID: <1247600013.7032.13.camel@localhost> This isn't the app stdout/stderr, but the job stdout/stderr. They are redirected in coasters for debugging/accounting purposes, and with SGE because the [censored] thing doesn't work otherwise. On Tue, 2009-07-14 at 14:29 -0500, Michael Wilde wrote: > Will the current code work for swift programs that dont use stdout or > stderr? (Ie where the app wrappers redirect these to a file?) > > - Mike > > On 7/14/09 2:21 PM, Mihael Hategan wrote: > > On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: > >> darn... > >> > >> Execution failed: > >> Exception in RInvoke: > >> Arguments: [scripts/4reg_dummy.R, > >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] > >> Host: RANGER > >> Directory: > >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Block task failed: 0714-090151-000000Block task ended > >> prematurely > >> > >> Cleaning up... > >> Shutting down service at https://129.114.50.163:38571 > >> > >> i can file a bug report with TG if need be, but i'm not quite > >> sure the best thing to tell them (?) also, i'm wondering how > >> coasters was previously able to work around this bug? > > > > By redirecting stdout+stderr to memory, but that causes the "job manager > > could not stage out a file" problem. > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 14 14:46:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 14 Jul 2009 14:46:17 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247600013.7032.13.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> <4A5CDC7C.6070304@mcs.anl.gov> <1247600013.7032.13.camel@localhost> Message-ID: <4A5CE089.40603@mcs.anl.gov> So are any of the following reasonable ways to proceed? 1) Develop an SGE provider (hopefully heavily based on the PBS provider) and run on Ranger locally. 2) Debug getting Coasters, GRAM and SGE to coexist nicely (ie the debugging route in progress now) 3) Start the coaster service manually in one block allocation and have it rendezvous with Swift For (2) can we create a GRAM test job outside of Swift that we can debug, to try to find a set of GRAM options that work? I need to read the thread more carefully, but I dont understand if the problem is in Ranger SGE, the GRAM SGE jobmanager, or the interaction between them. I'll re-read the thread first before asking for more clarification; I didnt get it on first read. - Mike On 7/14/09 2:33 PM, Mihael Hategan wrote: > This isn't the app stdout/stderr, but the job stdout/stderr. They are > redirected in coasters for debugging/accounting purposes, and with SGE > because the [censored] thing doesn't work otherwise. > > On Tue, 2009-07-14 at 14:29 -0500, Michael Wilde wrote: >> Will the current code work for swift programs that dont use stdout or >> stderr? (Ie where the app wrappers redirect these to a file?) >> >> - Mike >> >> On 7/14/09 2:21 PM, Mihael Hategan wrote: >>> On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: >>>> darn... >>>> >>>> Execution failed: >>>> Exception in RInvoke: >>>> Arguments: [scripts/4reg_dummy.R, >>>> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] >>>> Host: RANGER >>>> Directory: >>>> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> Block task failed: 0714-090151-000000Block task ended >>>> prematurely >>>> >>>> Cleaning up... >>>> Shutting down service at https://129.114.50.163:38571 >>>> >>>> i can file a bug report with TG if need be, but i'm not quite >>>> sure the best thing to tell them (?) also, i'm wondering how >>>> coasters was previously able to work around this bug? >>> By redirecting stdout+stderr to memory, but that causes the "job manager >>> could not stage out a file" problem. >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jul 14 14:53:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Jul 2009 14:53:05 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <4A5CE089.40603@mcs.anl.gov> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> <4A5CDC7C.6070304@mcs.anl.gov> <1247600013.7032.13.camel@localhost> <4A5CE089.40603@mcs.anl.gov> Message-ID: <1247601185.7638.3.camel@localhost> On Tue, 2009-07-14 at 14:46 -0500, Michael Wilde wrote: > So are any of the following reasonable ways to proceed? > > 1) Develop an SGE provider (hopefully heavily based on the PBS provider) > and run on Ranger locally. > > 2) Debug getting Coasters, GRAM and SGE to coexist nicely (ie the > debugging route in progress now) Yeah. I mentioned those two yesterday. > > 3) Start the coaster service manually in one block allocation and have > it rendezvous with Swift Possible. You could also force the current one to allocate a single block or even ignore the stageout error because it occurs after a block is done. > > For (2) can we create a GRAM test job outside of Swift that we can > debug, to try to find a set of GRAM options that work? You're welcome to try. I haven't been able to do it so far. From skenny at uchicago.edu Tue Jul 14 16:57:40 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 14 Jul 2009 16:57:40 -0500 (CDT) Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247599317.7032.0.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> Message-ID: <20090714165740.CAE89638@m4500-02.uchicago.edu> ---- Original message ---- >Date: Tue, 14 Jul 2009 14:21:57 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] Coasters and std's on ranger >To: skenny at uchicago.edu >Cc: swift-devel > >On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: >> darn... >> >> Execution failed: >> Exception in RInvoke: >> Arguments: [scripts/4reg_dummy.R, >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] >> Host: RANGER >> Directory: >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Block task failed: 0714-090151-000000Block task ended >> prematurely >> >> Cleaning up... >> Shutting down service at https://129.114.50.163:38571 >> >> i can file a bug report with TG if need be, but i'm not quite >> sure the best thing to tell them (?) also, i'm wondering how >> coasters was previously able to work around this bug? > >By redirecting stdout+stderr to memory, but that causes the "job manager >could not stage out a file" problem. actually, i meant (way back when this worked for me :) prior to any of the redirection (circa swift stable release 0.8ish)...but perhaps that's assuming the sge bug existed then as well... From hategan at mcs.anl.gov Tue Jul 14 17:16:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Jul 2009 17:16:38 -0500 Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <20090714165740.CAE89638@m4500-02.uchicago.edu> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> <20090714165740.CAE89638@m4500-02.uchicago.edu> Message-ID: <1247609798.10133.1.camel@localhost> On Tue, 2009-07-14 at 16:57 -0500, skenny at uchicago.edu wrote: > ---- Original message ---- > >Date: Tue, 14 Jul 2009 14:21:57 -0500 > >From: Mihael Hategan > >Subject: Re: [Swift-devel] Coasters and std's on ranger > >To: skenny at uchicago.edu > >Cc: swift-devel > > > >On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: > >> darn... > >> > >> Execution failed: > >> Exception in RInvoke: > >> Arguments: [scripts/4reg_dummy.R, > >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] > >> Host: RANGER > >> Directory: > >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Block task failed: 0714-090151-000000Block task ended > >> prematurely > >> > >> Cleaning up... > >> Shutting down service at https://129.114.50.163:38571 > >> > >> i can file a bug report with TG if need be, but i'm not quite > >> sure the best thing to tell them (?) also, i'm wondering how > >> coasters was previously able to work around this bug? > > > >By redirecting stdout+stderr to memory, but that causes the > "job manager > >could not stage out a file" problem. > > actually, i meant (way back when this worked for me :) prior > to any of the redirection (circa swift stable release > 0.8ish)...but perhaps that's assuming the sge bug existed then > as well... Yes, it did. But due to the way the coasters worked at the time, the error was ignored. I can make it such that this is the case again. From skenny at uchicago.edu Tue Jul 14 17:22:51 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 14 Jul 2009 17:22:51 -0500 (CDT) Subject: [Swift-devel] Coasters and std's on ranger In-Reply-To: <1247609798.10133.1.camel@localhost> References: <20090713215411.CAD90960@m4500-02.uchicago.edu> <1247540751.30172.5.camel@localhost> <1247541538.30172.8.camel@localhost> <20090714020526.CAE04557@m4500-02.uchicago.edu> <1247583541.1437.1.camel@localhost> <20090714101140.CAE36286@m4500-02.uchicago.edu> <1247584946.1759.8.camel@localhost> <20090714132226.CAE63545@m4500-02.uchicago.edu> <1247599317.7032.0.camel@localhost> <20090714165740.CAE89638@m4500-02.uchicago.edu> <1247609798.10133.1.camel@localhost> Message-ID: <20090714172251.CAE92317@m4500-02.uchicago.edu> ---- Original message ---- >Date: Tue, 14 Jul 2009 17:16:38 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] Coasters and std's on ranger >To: skenny at uchicago.edu >Cc: swift-devel > >On Tue, 2009-07-14 at 16:57 -0500, skenny at uchicago.edu wrote: >> ---- Original message ---- >> >Date: Tue, 14 Jul 2009 14:21:57 -0500 >> >From: Mihael Hategan >> >Subject: Re: [Swift-devel] Coasters and std's on ranger >> >To: skenny at uchicago.edu >> >Cc: swift-devel >> > >> >On Tue, 2009-07-14 at 13:22 -0500, skenny at uchicago.edu wrote: >> >> darn... >> >> >> >> Execution failed: >> >> Exception in RInvoke: >> >> Arguments: [scripts/4reg_dummy.R, >> >> matrices/4_reg/network1/gestspeech.cov, 29, 0.5, speech] >> >> Host: RANGER >> >> Directory: >> >> 4reg_speech-20090714-1309-b650zi68/jobs/3/RInvoke-351ksndj >> >> stderr.txt: >> >> >> >> stdout.txt: >> >> >> >> ---- >> >> >> >> Caused by: >> >> Block task failed: 0714-090151-000000Block task ended >> >> prematurely >> >> >> >> Cleaning up... >> >> Shutting down service at https://129.114.50.163:38571 >> >> >> >> i can file a bug report with TG if need be, but i'm not quite >> >> sure the best thing to tell them (?) also, i'm wondering how >> >> coasters was previously able to work around this bug? >> > >> >By redirecting stdout+stderr to memory, but that causes the >> "job manager >> >could not stage out a file" problem. >> >> actually, i meant (way back when this worked for me :) prior >> to any of the redirection (circa swift stable release >> 0.8ish)...but perhaps that's assuming the sge bug existed then >> as well... > >Yes, it did. But due to the way the coasters worked at the time, the >error was ignored. I can make it such that this is the case again. > sounds good to me. From benc at hawaga.org.uk Wed Jul 15 03:36:23 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Jul 2009 08:36:23 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-commit] r3008 - SwiftApps/SEE/trunk In-Reply-To: <20090715020426.227839CCC4@vm-125-59.ci.uchicago.edu> References: <20090715020426.227839CCC4@vm-125-59.ci.uchicago.edu> Message-ID: You should file a bug describing this. On Tue, 14 Jul 2009, noreply at vm-125-59.ci.uchicago.edu wrote: > Author: aespinosa > Date: 2009-07-14 21:04:25 -0500 (Tue, 14 Jul 2009) > New Revision: 3008 > > Modified: > SwiftApps/SEE/trunk/instance_mapper.sh > Log: > swift code now compiles with struct hack from ben > > Modified: SwiftApps/SEE/trunk/instance_mapper.sh > =================================================================== > --- SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-14 22:06:08 UTC (rev 3007) > +++ SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-15 02:04:25 UTC (rev 3008) > @@ -15,6 +15,7 @@ > > echo "ofile result/$instance/stdout"; > > +echo "out null"; > echo "out.expend_out result/$instance/expend.out"; > echo "out.price_out result/$instance/price.out"; > echo "out.ratio_out result/$instance/ratio.out"; > > _______________________________________________ > Swift-commit mailing list > Swift-commit at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-commit > > From bugzilla-daemon at mcs.anl.gov Wed Jul 15 11:45:33 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 15 Jul 2009 11:45:33 -0500 (CDT) Subject: [Swift-devel] [Bug 217] New: struct of structs via ext mapper Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=217 Summary: struct of structs via ext mapper Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: aespinosa at cs.uchicago.edu Swift is looking for a file describing the 2nd level struct itself. my swift session (latest on cog svn and swift svn) reports as follows: RunID: testing Progress: Progress: Initializing site shared directory:1 Failed:1 Execution failed: Mapper failed to map org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090714-1343-6lzjg014:720000000039 type AmplFilter with no value at dataset=res path=.out (not closed) my instance_mapper.sh: #!/bin/bash while getopts ":i:" options; do case $options in i) export instance=$OPTARG ;; *) exit 1;; esac done echo "expend result/$instance/expend.dat"; echo "limits result/$instance/limits.dat"; echo "price result/$instance/price.dat"; echo "ratio result/$instance/ratio.dat"; echo "solve result/$instance/solve.dat"; echo "ofile result/$instance/stdout"; echo "out.expend_out result/$instance/expend.out"; echo "out.price_out result/$instance/price.out"; echo "out.ratio_out result/$instance/ratio.out"; here is the workflow i was working on: type Template; type AmplIn; type StdOut; type AmplCmd { Template temp; AmplIn mod; AmplIn process; AmplIn output; AmplIn so; AmplIn tree; } type ExpendDat; type LimitsDat; type PriceDat; type RatioDat; type SolveDat; type ExpendOut; type PriceOut; type RatioOut; type AmplFilter { ExpendOut expend_out; PriceOut price_out; RatioOut ratio_out; } type AmplResult { ExpendDat expend; LimitsDat limits; PriceDat price; RatioDat ratio; SolveDat solve; StdOut ofile; AmplFilter out; } app (AmplResult result) run_ampl (string instanceID, AmplCmd cmd) { run_ampl instanceID @filename(cmd.temp) @filename(cmd.mod) @filename(cmd.process) @filename(cmd.output) @filename(cmd.so) @filename(cmd.tree) stdout=@filename(result.ofile); } AmplCmd const_cmd ; int runs[]=[2001:2002]; foreach i in runs { string instanceID = @strcat("run", i); AmplResult res ; res = run_ampl(instanceID, const_cmd); } Initial hack to get the script to work: > Author: aespinosa > Date: 2009-07-14 21:04:25 -0500 (Tue, 14 Jul 2009) > New Revision: 3008 > > Modified: > SwiftApps/SEE/trunk/instance_mapper.sh > Log: > swift code now compiles with struct hack from ben > > Modified: SwiftApps/SEE/trunk/instance_mapper.sh > =================================================================== > --- SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-14 22:06:08 UTC (rev 3007) > +++ SwiftApps/SEE/trunk/instance_mapper.sh 2009-07-15 02:04:25 UTC (rev 3008) > @@ -15,6 +15,7 @@ > > echo "ofile result/$instance/stdout"; > > +echo "out null"; > echo "out.expend_out result/$instance/expend.out"; > echo "out.price_out result/$instance/price.out"; > echo "out.ratio_out result/$instance/ratio.out"; -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From iraicu at cs.uchicago.edu Wed Jul 15 14:04:49 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 15 Jul 2009 14:04:49 -0500 Subject: [Swift-devel] CFP: 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS09) at Supercomputing 2009 Message-ID: <4A5E2851.1000500@cs.uchicago.edu> Call for Papers --------------------------------------------------------------------------------------- The 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2009 http://dsl.cs.uchicago.edu/MTAGS09/ --------------------------------------------------------------------------------------- November 16th, 2009 Portland, Oregon, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC09) ======================================================================================= The 2nd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of loosely coupled large scale applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. Many-task computing (MTC), the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. The workshop will be co-located with the IEEE/ACM Supercomputing 2009 Conference in Portland Oregon on November 16th, 2009. For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/. Scope --------------------------------------------------------------------------------------- This workshop will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with 50K+ processor cores are beginning to come online (i.e. TACC Sun Constellation System - Ranger), Grids (i.e. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with 160K processors (i.e. IBM BlueGene/P). Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems. In contrast to HPC (tightly coupled applications), these loosely coupled applications make up a new class of applications as what we call Many-Task Computing (MTC). MTC systems generally involve the execution of independent, sequential jobs that can be individually scheduled on many different computing resources across multiple administrative boundaries. MTC systems typically achieve this using various grid computing technologies and techniques, and often times use files to achieve the inter-process communication as alternative communication mechanisms than MPI. MTC is reminiscent to High Throughput Computing (HTC); however, MTC differs from HTC in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks, where the primary metrics are measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the other hand requires large amounts of computing for longer times (months and years, rather than hours and days, and are generally measured in operations per month). Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, shared file system contention and scalability, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. Furthermore, the MTC paradigm can be naturally applied to the emerging Cloud Computing paradigm due to its loosely coupled nature, which is being adopted by industry as the next wave of technological advancement to reduce operational costs while improving efficiencies in large scale infrastructures. For an interesting discussion in a blog by Ian Foster on the difference between MTC and HTC, please see his blog athttp://ianfoster.typepad.com/blog/2008/07/many-tasks-comp.html. We also published two papers that are highly relevant to this workshop. One paper is titled "Toward Loosely Coupled Programming on Petascale Systems", and was published in SC08; the second paper is titled "Many-Task Computing for Grids and Supercomputers", which was published in MTAGS08. Furthermore, to see last year's workshop program agenda, and accepted papers and presentations, please seehttp://dsl.cs.uchicago.edu/MTAGS08/. For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/. Topics --------------------------------------------------------------------------------------- MTAGS 2008 topics of interest include, but are not limited to: * Compute Resource Management in large scale clusters, large Grids, Supercomputers, or Cloud Computing infrastructure o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Challenges and opportunities in running many-task workloads on HPC systems o Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Data Management in large scale Grid and Supercomputer environments: o Data-Aware Scheduling o Parallel File System performance and scalability in large deployments o Distributed file systems o Data caching frameworks and techniques * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o Large-scale many-task applications o Large-scale many-task data-intensive applications o Large-scale high throughput computing (HTC) applications o Quasi-supercomputing applications, deployments, and experiences Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found athttp://www.acm.org/sigs/publications/proceedings-templates. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2009/ before the deadline of August 1st, 2009 at 11:59PM PST; the final 10 page papers in PDF format will be due on September 1st, 2009 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Notifications of the paper decisions will be sent out by October 1st, 2009. Selected excellent work will be invited to submit extended versions of the workshop paper to the IEEE Transactions on Parallel and Distributed Systems (TPDS) Journal, Special Issue on Many-Task Computing (due December 21st, 2009); for more information about this journal special issue, please visithttp://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: August 1st, 2009 * Papers Due: September 1st, 2009 * Notification of Acceptance: October 1st, 2009 * Camera Ready Papers Due: November 1st, 2009 * Workshop Date: November 16th, 2009 Committee Members --------------------------------------------------------------------------------------- Workshop Chairs * Ioan Raicu, University of Chicago * Ian Foster, University of Chicago& Argonne National Laboratory * Yong Zhao, Microsoft Technical Committee (confirmed) * David Abramson, Monash University, Australia * Pete Beckman, Argonne National Laboratory, USA * Peter Dinda, Northwestern University, USA * Ian Foster, University of Chicago& Argonne National Laboratory, USA * Bob Grossman, University of Illinois at Chicago, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Kamil Iskra, Argonne National Laboratory, USA * Chuang Liu, Ask.com, USA * Zhou Lei, Shanghai University, China * Shiyong Lu, Wayne State University, USA * Reagan Moore, University of North Carolina at Chapel Hill, USA * Marlon Pierce, Indiana University, USA * Ioan Raicu, University of Chicago, USA * Matei Ripeanu, University of British Columbia, Canada * David Swanson, University of Nebraska, USA * Greg Thain, Univeristy of Wisconsin, USA * Matthew Woitaszek, The University Corporation for Atmospheric Research, USA * Mike Wilde, University of Chicago& Argonne National Laboratory, USA * Sherali Zeadally, University of the District of Columbia, USA * Yong Zhao, Microsoft, USA -------------- next part -------------- An HTML attachment was scrubbed... URL: From tanu00 at gmail.com Fri Jul 17 11:01:11 2009 From: tanu00 at gmail.com (Tanu Malik) Date: Fri, 17 Jul 2009 12:01:11 -0400 Subject: [Swift-devel] Provenance DB for Swift Message-ID: <66d19ae50907170901n533e1f4dl686f22b1c747cf7e@mail.gmail.com> Hi Ben, Mike I was wondering if there is open access to the Provenance DB for Swift ? We have built a provenance query and database that performs distributed provenance querying. Our examples are currently all artificial and I was wondering if we can test the same with Provenance DB for Swift. We have a deadline in Sept. and an early reply from you will be very helpful. Thanks, Tanu From hategan at mcs.anl.gov Fri Jul 17 15:35:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 17 Jul 2009 15:35:21 -0500 Subject: [Swift-devel] Coaster CPU-time consumption issue In-Reply-To: <1247524902.25358.3.camel@localhost> References: <4A5B64C7.4080802@mcs.anl.gov> <1247504642.17460.6.camel@localhost> <4A5B6ED6.60508@mcs.anl.gov> <1247509395.20144.4.camel@localhost> <50b07b4b0907131155p1718622ck1f02f2a04257d5e9@mail.gmail.com> <1247511969.21171.4.camel@localhost> <50b07b4b0907131212w3a9e37c6p464870741c83bade@mail.gmail.com> <4A5BA007.2050101@mcs.anl.gov> <50b07b4b0907131504l50a1bb6byab61991fc2132e00@mail.gmail.com> <1247524902.25358.3.camel@localhost> Message-ID: <1247862921.9627.1.camel@localhost> On that same topic, cog r2438 removes another spin that would get triggered in certain circumstances (after a bunch of jobs are done). On Mon, 2009-07-13 at 17:41 -0500, Mihael Hategan wrote: > A slightly modified version of this is in cog r2429. > > Thanks again, > > Mihael > > On Mon, 2009-07-13 at 17:04 -0500, Allan Espinosa wrote: > > hi, > > > > here is a patch which solves the cpu usage on the bootstrap coaster > > service: http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch > > > > suggested svn log entry: > > Added locks via wait() and notify() to prevent busy waiting/ > > active polling in the block task queue. > > > > > > Test 2000 touch job using 066-many.swift via local:local : > > before: http://www.ci.uchicago.edu/~aespinosa/swift/run06 > > after: http://www.ci.uchicago.edu/~aespinosa/swift/run07 > > > > CPU usage drops from 100% to 0% with a few 25-40 % spikes! > > > > -Allan > > > > > > 2009/7/13 Michael Wilde : > > > Hi Allan, > > > > > > I think the methods you want for synchronization are part of class Object. > > > > > > They are documented in the chapter Threads and Locks of The Java Language > > > Specification: > > > > > > http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8 > > > > > > queue.wait() should be called if the queue is empty. > > > > > > queue.notify() or .notifyall() should be called when something is added to > > > the queue. I think notify() should work. > > > > > > .wait will I think take a timer, but suspect you dont need that. > > > > > > Both should be called within the synchronized(queue) constructs that are > > > already in the code. > > > > > > Should be fun to fix this! > > > > > > - Mike > > > > > > > > > > > > > > > > > > On 7/13/09 2:12 PM, Allan Espinosa wrote: > > >> > > >> 97% is an average as can be seen in run06. swift version is r3005 and > > >> cogkit r2410. this is a vanilla build of swift. > > >> > > >> 2009/7/13 Mihael Hategan : > > >>> > > >>> A while ago I committed a patch to run the service process with a lower > > >>> priority. Is that in use? > > >>> > > >>> Also, is logging reduced or is it the default? > > >>> > > >>> Is the 97% CPU usage a spike, or does it stay there on average? > > >>> > > >>> Can I take a look at the coaster logs from skenny's run on ranger? > > >>> > > >>> I'd also like to point out in as little offensive mode as I can, that > > >>> I'm working 100% on I2U2 and my lack of getting more than lightly > > >>> involved in this is a consequence of that. > > >>> > > >>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote: > > >>>> > > >>>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From > > >>>> here process 22395 is the child of the main java process > > >>>> (bootstrap.jar) and is loading the CPU. > > >>>> > > >>>> I have coasters.log, worker-*log, swift logs, gram logs in > > >>>> ~aespinosa/workflows/activelog/run06. This refers to a different run. > > >>>> PID 15206 is the child java process of bootstrap.jar in here. > > >>>> > > >>>> top snapshot: > > >>>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, > > >>>> 0.55 > > >>>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie > > >>>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, > > >>>> 0.0%st > > >>>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers > > >>>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached > > >>>> > > >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > >>>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java > > >>>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top > > >>>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 > > >>>> globus-job-mana > > >>>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd > > >>>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash > > >>>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash > > >>>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash > > >>>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java > > >>>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 > > >>>> globus-job-man > > >>>> > > >>>> ps snapshot: > > >>>> > > >>>> 22328 ? S 0:00 \_ /bin/bash > > >>>> 22364 ? Sl 0:00 \_ > > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java > > >>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= > > >>>> > > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > > >>>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar > > >>>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520 > > >>>> https://128.135.125.17:46519 11505253269 > > >>>> 22395 ? SNl 6:29 \_ > > >>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M > > >>>> > > >>>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up > > >>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu > > >>>> -Djava.security.egd=file:///dev/urandom -cp > > >>>> > > >>>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196f cd > ec9 > > > > > > 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b860632568434270 1c > .jar > > > :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendez vo > us_s > > > ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc > > >>>> > > >>>> > > >>>> > > >>>> 2009/7/13 Mihael Hategan : > > >>>>> > > >>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote: > > >>>>>>>> > > >>>>>>>> At the time we did not have a chance to gather detailed evidence, > > >>>>>>>> but I > > >>>>>>>> was surprised by two things: > > >>>>>>>> > > >>>>>>>> - that there were two Java processes and that one was so big. (Are > > >>>>>>>> most > > >>>>>>>> likely the active process was just a child thread of the main > > >>>>>>>> process?) > > >>>>>>> > > >>>>>>> One java process is the bootstrap process (it downloads the coaster > > >>>>>>> jars, sets up the environment and runs the coaster service). It has > > >>>>>>> always been like this. Did you happen to capture the output of ps to > > >>>>>>> a > > >>>>>>> file? That would be useful, because from what you are suggesting, it > > >>>>>>> appears that the bootstrap process is eating 100% CPU. That process > > >>>>>>> should only be sleeping after the service is started. > > >>>>>> > > >>>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I > > >>>>>> cant > > >>>>>> locate it. > > >>>>>> > > >>>>>> As best as I can recall it showed the larger memory-footprint process > > >>>>>> to > > >>>>>> be relatively idle, and the smaller footprint process (about 275MB) to > > >>>>>> be burning 100% of a CPU. > > >>>>> > > >>>>> Normally, the smaller footprint process should be the bootstrap. But > > >>>>> that's why I would like the ps output, because it sounds odd. > > >>>>> > > >>>>>> Allan will try to get a snapshot of this shortly. > > >>>>>> > > >>>>>> If this observation if correct, whats the best way to find out where > > >>>>>> its > > >>>>>> spinning? Profiling? Debug logging? Can you get profiling data from a > > >>>>>> JVM that doesnt exit? > > >>>>> > > >>>>> Once I know where it is, I can look at the code and then we'll go from > > >>>>> there. > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >>> > > >> > > >> > > >> > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Mon Jul 20 17:11:04 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 20 Jul 2009 17:11:04 -0500 Subject: [Swift-devel] coasters submit jobs with "count=0" in its globus RSL params Message-ID: <50b07b4b0907201511p758167f2v3fe24a5dca1da099@mail.gmail.com> session message: Caused by: Block task failed: Error submitting block task org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66) Caused by: org.globus.gram.GramException: The provided RSL 'count' value is invalid (not an integer or must be greater than 0) at org.globus.gram.Gram.request(Gram.java:358) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) ... 4 more Cleaning up... Shutting down service at https://129.114.50.163:45035 snippet of coasters.log: 2009-07-20 17:02:02,344-0500 INFO BlockQueueProcessor Settings { slots = 2 workersPerNode = 16 nodeGranularity = 1 allocationStepSize = 0.1 maxNodes = 2 lowOverallocation = 10.0 highOverallocation = 1.0 overallocationDecayFactor = 0.0010 spread = 0.9 reserve = 10.000s maxtime = 86400 project = TG-CCR080022N queue = normal remoteMonitorEnabled = false } 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor Required size: 230400 for 16 jobs 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 28800, jj: 14400, x-last: , r: 1 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 43200, w: 2, size: 230400, msz: 230400, w*h: 86400 2009-07-20 17:02:02,355-0500 INFO BlockQueueProcessor Added: 0 - 5 2009-07-20 17:02:02,355-0500 INFO Block Starting block: workers=2, walltime=43200.000s 2009-07-20 17:02:02,358-0500 INFO BlockTaskSubmitter Queuing block Block 0720-010553-000000 (2x43200.000s) for submission 2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Added 6 jobs to new blocks 2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Plan time: 55 2009-07-20 17:02:02,359-0500 INFO BlockTaskSubmitter Submitting block Block 0720-010553-000000 (2x43200.000s) 2009-07-20 17:02:02,379-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1248127320448) setting status to Submitting 2009-07-20 17:02:02,381-0500 INFO Block Block task status changed: Submitting ---end-- with w=2, count = 2 / 16 = 0 when a Block is instantiated. sites.xml: TG-CCR080022N 16 normal 10000 0.32 2 2 4:00:00 86400 /scratch/01035/tg802895/see_runs obviously i need to get the right mix of overAllocation parameters. but an invalid RSL entry should at least be caught. I'll try to understand better BlockQueueProcessor.allocateBlocks to have at least an intelligent guess on what these values should be. -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Mon Jul 20 17:18:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 20 Jul 2009 17:18:31 -0500 Subject: [Swift-devel] coasters submit jobs with "count=0" in its globus RSL params In-Reply-To: <50b07b4b0907201511p758167f2v3fe24a5dca1da099@mail.gmail.com> References: <50b07b4b0907201511p758167f2v3fe24a5dca1da099@mail.gmail.com> Message-ID: <4A64ED37.7010108@mcs.anl.gov> Sarah, is this the same error you have been getting? (Invalid RSL count field?) - Mike On 7/20/09 5:11 PM, Allan Espinosa wrote: > session message: > Caused by: > Block task failed: Error submitting block task > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) > at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66) > Caused by: org.globus.gram.GramException: The provided RSL 'count' > value is invalid (not an integer or must be greater than 0) > at org.globus.gram.Gram.request(Gram.java:358) > at org.globus.gram.GramJob.request(GramJob.java:262) > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) > ... 4 more > > Cleaning up... > Shutting down service at https://129.114.50.163:45035 > > snippet of coasters.log: > 2009-07-20 17:02:02,344-0500 INFO BlockQueueProcessor > Settings { > slots = 2 > workersPerNode = 16 > nodeGranularity = 1 > allocationStepSize = 0.1 > maxNodes = 2 > lowOverallocation = 10.0 > highOverallocation = 1.0 > overallocationDecayFactor = 0.0010 > spread = 0.9 > reserve = 10.000s > maxtime = 86400 > project = TG-CCR080022N > queue = normal > remoteMonitorEnabled = false > } > > 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor Required size: > 230400 for 16 jobs > 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 28800, jj: > 14400, x-last: , r: 1 > 2009-07-20 17:02:02,345-0500 INFO BlockQueueProcessor h: 43200, w: 2, > size: 230400, msz: 230400, w*h: 86400 > 2009-07-20 17:02:02,355-0500 INFO BlockQueueProcessor Added: 0 - 5 > 2009-07-20 17:02:02,355-0500 INFO Block Starting block: workers=2, > walltime=43200.000s > 2009-07-20 17:02:02,358-0500 INFO BlockTaskSubmitter Queuing block > Block 0720-010553-000000 (2x43200.000s) for submission > 2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Added 6 jobs to > new blocks > 2009-07-20 17:02:02,359-0500 INFO BlockQueueProcessor Plan time: 55 > 2009-07-20 17:02:02,359-0500 INFO BlockTaskSubmitter Submitting block > Block 0720-010553-000000 (2x43200.000s) > 2009-07-20 17:02:02,379-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:cog-1248127320448) setting status to Submitting > 2009-07-20 17:02:02,381-0500 INFO Block Block task status changed: Submitting > ---end-- > > with w=2, count = 2 / 16 = 0 when a Block is instantiated. > > sites.xml: > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > TG-CCR080022N > 16 > normal > 10000 > 0.32 > 2 > 2 > 4:00:00 > 86400 > > url="gt2://gatekeeper.ranger.tacc.teragrid.org" /> > /scratch/01035/tg802895/see_runs > > > > obviously i need to get the right mix of overAllocation parameters. > but an invalid RSL entry should at least be caught. > > I'll try to understand better BlockQueueProcessor.allocateBlocks to > have at least an intelligent guess on what these values should be. > > From smartin at mcs.anl.gov Tue Jul 21 10:58:15 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 21 Jul 2009 10:58:15 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> Message-ID: Are there any swift apps that can use queen bee? There is a GRAM5 service setup there for testing. -Stu Begin forwarded message: > From: Stuart Martin > Date: July 21, 2009 10:56:04 AM CDT > To: gateways at teragrid.org > Cc: Stuart Martin , Lukasz Lacinski > > Subject: Fwd: [gram-user] GRAM5 Alpha2 > > Hi Gateways, > > Any gateways that use (or can use) Queen Bee, it would be great if > you could target this new GRAM5 service that Lukasz deployed. I > heard from Lukasz that Jim has submitted a gateway user (SAML) job > and that went through fine and populate the gram audit DB > correctly. Thanks Jim! It would be nice to have some gateway push > the service to test scalability. > > Let us know if you plan to do this. > > Thanks, > Stu > > Begin forwarded message: > >> From: Lukasz Lacinski >> Date: July 21, 2009 1:18:05 AM CDT >> To: gram-user at lists.globus.org >> Subject: [gram-user] GRAM5 Alpha2 >> >> I've installed GRAM5 Alpha2 on Queen Bee. >> >> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >> >> -seg-module pbs works fine. >> GRAM audit with PostgreSQL works fine. >> >> Can someone submit jobs as a gateway user? I'd like to check if the >> gateway_user field is written to our audit database. >> >> Thanks, >> Lukasz > From tiberius at ci.uchicago.edu Tue Jul 21 11:05:28 2009 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 21 Jul 2009 11:05:28 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> Message-ID: Hi Stu I was just installing yesterday my application on queenbee. So I could do some testing for you, just let me know how to take advantage of the new GRAM5 Does cogkit/swift already support GRAM5 ? Tibi On Tue, Jul 21, 2009 at 10:58 AM, Stuart Martin wrote: > Are there any swift apps that can use queen bee? ?There is a GRAM5 service > setup there for testing. > > -Stu > > Begin forwarded message: > >> From: Stuart Martin >> Date: July 21, 2009 10:56:04 AM CDT >> To: gateways at teragrid.org >> Cc: Stuart Martin , Lukasz Lacinski >> >> Subject: Fwd: [gram-user] GRAM5 Alpha2 >> >> Hi Gateways, >> >> Any gateways that use (or can use) Queen Bee, it would be great if you >> could target this new GRAM5 service that Lukasz deployed. ?I heard from >> Lukasz that Jim has submitted a gateway user (SAML) job and that went >> through fine and populate the gram audit DB correctly. ?Thanks Jim! ?It >> would be nice to have some gateway push the service to test scalability. >> >> Let us know if you plan to do this. >> >> Thanks, >> Stu >> >> Begin forwarded message: >> >>> From: Lukasz Lacinski >>> Date: July 21, 2009 1:18:05 AM CDT >>> To: gram-user at lists.globus.org >>> Subject: [gram-user] GRAM5 Alpha2 >>> >>> I've installed GRAM5 Alpha2 on Queen Bee. >>> >>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>> >>> -seg-module pbs works fine. >>> GRAM audit with PostgreSQL works fine. >>> >>> Can someone submit jobs as a gateway user? I'd like to check if the >>> gateway_user field is written to our audit database. >>> >>> Thanks, >>> Lukasz >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Tue Jul 21 11:20:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Jul 2009 11:20:34 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> Message-ID: <1248193234.11850.25.camel@localhost> On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote: > Hi Stu > > I was just installing yesterday my application on queenbee. So I could > do some testing for you, just let me know how to take advantage of the > new GRAM5 > Does cogkit/swift already support GRAM5 ? Should work with it out-of-the-box. But then testing is for verifying that. From smartin at mcs.anl.gov Tue Jul 21 11:29:37 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 21 Jul 2009 11:29:37 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: <1248193234.11850.25.camel@localhost> References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <1248193234.11850.25.camel@localhost> Message-ID: Wonderful. Let us know how it goes. -Stu On Jul 21, 2009, at Jul 21, 11:20 AM, Mihael Hategan wrote: > On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote: >> Hi Stu >> >> I was just installing yesterday my application on queenbee. So I >> could >> do some testing for you, just let me know how to take advantage of >> the >> new GRAM5 >> Does cogkit/swift already support GRAM5 ? > > Should work with it out-of-the-box. But then testing is for verifying > that. > > From tiberius at ci.uchicago.edu Tue Jul 21 11:30:45 2009 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 21 Jul 2009 11:30:45 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <1248193234.11850.25.camel@localhost> Message-ID: So how do I test ? Some instructions would help ... On Tue, Jul 21, 2009 at 11:29 AM, Stuart Martin wrote: > Wonderful. ?Let us know how it goes. > > -Stu > > On Jul 21, 2009, at Jul 21, 11:20 AM, Mihael Hategan wrote: > >> On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote: >>> >>> Hi Stu >>> >>> I was just installing yesterday my application on queenbee. So I could >>> do some testing for you, just let me know how to take advantage of the >>> new GRAM5 >>> Does cogkit/swift already support GRAM5 ? >> >> Should work with it out-of-the-box. But then testing is for verifying >> that. >> >> > > -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Tue Jul 21 11:39:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Jul 2009 11:39:10 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <1248193234.11850.25.camel@localhost> Message-ID: <1248194350.12672.0.camel@localhost> 1. Install your app on queenbee 2. Find the jobmanager contact for gram5 and put that in your sites.xml, together with the gridftp contact 3. Run swift On Tue, 2009-07-21 at 11:30 -0500, Tiberiu Stef-Praun wrote: > So how do I test ? > Some instructions would help ... > > > On Tue, Jul 21, 2009 at 11:29 AM, Stuart Martin wrote: > > Wonderful. Let us know how it goes. > > > > -Stu > > > > On Jul 21, 2009, at Jul 21, 11:20 AM, Mihael Hategan wrote: > > > >> On Tue, 2009-07-21 at 11:05 -0500, Tiberiu Stef-Praun wrote: > >>> > >>> Hi Stu > >>> > >>> I was just installing yesterday my application on queenbee. So I could > >>> do some testing for you, just let me know how to take advantage of the > >>> new GRAM5 > >>> Does cogkit/swift already support GRAM5 ? > >> > >> Should work with it out-of-the-box. But then testing is for verifying > >> that. > >> > >> > > > > > > > From aespinosa at cs.uchicago.edu Tue Jul 21 11:49:21 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 21 Jul 2009 11:49:21 -0500 Subject: [Swift-devel] more on # of coasters workers vs actual requested on ranger Message-ID: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com> According to the gram logs, swift sends requests for blocks of 1, 2, 3 and 4 nodes but SGE receives requests for four 1 node jobs. This maybe a GRAM2-SGE interaction problem. Is there a way to get the globus RSL files from swift so I can submit manually and verify this? -Allan coasters.log: ... ... 2009-07-21 10:46:13,788-0500 INFO BlockQueueProcessor Required size: 28800 for 2 jobs 2009-07-21 10:46:13,788-0500 INFO BlockQueueProcessor h: 28800, jj: 14400, x-last: , r: 1 2009-07-21 10:46:13,788-0500 INFO BlockQueueProcessor h: 43200, w: 16, size: 28800, msz: 28800, w*h: 691200 2009-07-21 10:46:13,797-0500 INFO BlockQueueProcessor Added: 0 - 1 2009-07-21 10:46:13,797-0500 INFO Block Starting block: workers=16, walltime=43200.000s 2009-07-21 10:46:13,859-0500 INFO BlockTaskSubmitter Queuing block Block 0721-461009-000000 (16x43200.000s) for submission 2009-07-21 10:46:13,859-0500 INFO BlockQueueProcessor Added 2 jobs to new blocks 2009-07-21 10:46:13,860-0500 INFO BlockQueueProcessor Plan time: 287 2009-07-21 10:46:13,863-0500 INFO BlockTaskSubmitter Submitting block Block 0721-461009-000000 (16x43200.000s) 2009-07-21 10:46:13,887-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1248191171562) setting status to Submitting 2009-07-21 10:46:13,889-0500 INFO Block Block task status changed: Submitting 2009-07-21 10:46:15,339-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1248191171562) setting status to Submitted 2009-07-21 10:46:15,339-0500 INFO Block Block task status changed: Submitted ... ... ... 2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor Required size: 1152000 for 80 jobs 2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor h: 28800, jj: 14400, x-last: , r: 31 2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor h: 43200, w: 48, size: 1152000, msz: 1152000, w*h: 2073600 2009-07-21 10:46:31,545-0500 INFO BlockQueueProcessor Added: 0 - 79 2009-07-21 10:46:31,545-0500 INFO Block Starting block: workers=48, walltime=43200.000s 2009-07-21 10:46:31,546-0500 INFO BlockTaskSubmitter Queuing block Block 0721-461009-000001 (48x43200.000s) for submission 2009-07-21 10:46:31,546-0500 INFO BlockQueueProcessor Added 80 jobs to new blocks 2009-07-21 10:46:31,546-0500 INFO BlockQueueProcessor Plan time: 3 2009-07-21 10:46:31,546-0500 INFO BlockTaskSubmitter Submitting block Block 0721-461009-000001 (48x43200.000s) 2009-07-21 10:46:31,546-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1248191171941) setting status to Submitting 2009-07-21 10:46:31,547-0500 INFO Block Block task status changed: Submitting ... ... 2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor Requeued 133 non-fitting jobs 2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor Required size: 1915200 for 133 jobs 2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor h: 28800, jj: 14400, x-last: , r: 4 2009-07-21 10:46:33,755-0500 INFO BlockQueueProcessor h: 43200, w: 64, size: 1915200, msz: 1915200, w*h: 2764800 2009-07-21 10:46:33,756-0500 INFO BlockQueueProcessor Added: 0 - 132 2009-07-21 10:46:33,756-0500 INFO Block Starting block: workers=64, walltime=43200.000s 2009-07-21 10:46:33,756-0500 INFO BlockTaskSubmitter Queuing block Block 0721-461009-000002 (64x43200.000s) for submission 2009-07-21 10:46:33,757-0500 INFO BlockQueueProcessor Added 133 jobs to new blocks 2009-07-21 10:46:33,757-0500 INFO BlockQueueProcessor Plan time: 4 ... ... 2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor Required size: 705600 for 49 jobs 2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor h: 28800, jj: 14400, x-last: , r: 16 2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor h: 43200, w: 32, size: 705600, msz: 705600, w*h: 1382400 2009-07-21 10:46:35,980-0500 INFO BlockQueueProcessor Added: 0 - 48 2009-07-21 10:46:35,980-0500 INFO Block Starting block: workers=32, walltime=43200.000s 2009-07-21 10:46:35,981-0500 INFO BlockTaskSubmitter Queuing block Block 0721-461009-000003 (32x43200.000s) for submission 2009-07-21 10:46:35,981-0500 INFO BlockQueueProcessor Added 49 jobs to new blocks 2009-07-21 10:46:35,981-0500 INFO BlockQueueProcessor Plan time: 4 2009-07-21 10:46:35,981-0500 INFO BlockTaskSubmitter Submitting block Block 0721-461009-000003 (32x43200.000s) 2009-07-21 10:46:35,981-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1248191172858) setting status to Submitting 2009-07-21 10:46:35,982-0500 INFO Block Block task status changed: Submitting ... ... gram log snippets: log1: (16 cpus) ... 7/21 10:46:14 Pre-parsed RSL string: &( rsl_substitution = (GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue = "normal" )( project = "TG-CCR080022N" )( stdout = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1248191171562" )( arguments = "/share/home/01035/tg802895/.globus/coasters/cscript26994.pl" "http://1 29.114.50.163:52072" "0721-461009-000000" "16" )( count = "1" )( executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-12481911 71562" )( maxwalltime = "720" ) 7/21 10:46:14 ... log2: (48 cpus) ... 7/21 10:46:32 Pre-parsed RSL string: &( rsl_substitution = (GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue = "normal" )( project = "TG-CCR080022N" )( stdout = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1248191171941" )( arguments = "/share/home/01035/tg802895/.globus/coasters/cscript26994.pl" "http://1 29.114.50.163:52072" "0721-461009-000001" "16" )( count = "3" )( executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-12481911 71941" )( maxwalltime = "720" ) 7/21 10:46:32 ... log3: (64 cpus) ... 7/21 10:46:34 Pre-parsed RSL string: &( rsl_substitution = (GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue = "normal" )( project = "TG-CCR080022N" )( stdout = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1248191172533" )( arguments = "/share/home/01035/tg802895/.globus/coasters/cscript26994.pl" "http://1 29.114.50.163:52072" "0721-461009-000002" "16" )( count = "4" )( executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-12481911 72533" )( maxwalltime = "720" ) 7/21 10:46:34 ... log4: (32 cpus) ... 7/21 10:46:36 Pre-parsed RSL string: &( rsl_substitution = (GLOBUSRUN_GASS_URL "https://129.114.50.163:52077") )( queue = "normal" )( project = "TG-CCR080022N" )( stdout = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1248191172858" )( arguments = "/share/home/01035/tg802895/.globus/coasters/cscript26994.pl" "http://1 29.114.50.163:52072" "0721-461009-000003" "16" )( count = "2" )( executable = "/usr/bin/perl" )( stderr = $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-12481911 72858" )( maxwalltime = "720" ) 7/21 10:46:36 ... what was actually requested: login4$ showq -u ACTIVE JOBS-------------------------- JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME ================================================================================ 0 active jobs : 0 of 3828 hosts ( 0.00 %) WAITING JOBS------------------------ JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME ================================================================================ 873041 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:17 873043 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:33 873044 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:36 873045 data tg802895 Waiting 16 12:00:00 Tue Jul 21 10:46:38 WAITING JOBS WITH JOB DEPENDENCIES--- JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME ================================================================================ UNSCHEDULED JOBS--------------------- JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME ================================================================================ Total jobs: 4 Active Jobs: 0 Waiting Jobs: 4 Dep/Unsched Jobs: 0 login4$ -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Tue Jul 21 11:59:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Jul 2009 11:59:58 -0500 Subject: [Swift-devel] more on # of coasters workers vs actual requested on ranger In-Reply-To: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com> References: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com> Message-ID: <1248195598.12972.6.camel@localhost> On Tue, 2009-07-21 at 11:49 -0500, Allan Espinosa wrote: > According to the gram logs, swift sends requests for blocks of 1, 2, 3 > and 4 nodes but SGE receives requests for four 1 node jobs. This > maybe a GRAM2-SGE interaction problem. Is there a way to get the > globus RSL files from swift so I can submit manually and verify this? In cog/modules/coaster/resources/log4.properties add: log4j.logger.org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG Then re-compile. But I don't think you need to go that far. Write your own RSL. In particular I'd suggest trying with both jobType=multiple and without. From aespinosa at cs.uchicago.edu Tue Jul 21 13:13:12 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 21 Jul 2009 13:13:12 -0500 Subject: [Swift-devel] more on # of coasters workers vs actual requested on ranger In-Reply-To: <1248195598.12972.6.camel@localhost> References: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com> <1248195598.12972.6.camel@localhost> Message-ID: <50b07b4b0907211113l4e554679j3de4743a7030fe85@mail.gmail.com> aha. on Ranger the count clause , refers to the number of cpus hence when coasters is requesting for count=4 it only needs 1 node. if we want to do a workersPerNode=16 then we should manually specify host_count=4 instead of count=4. or just use workersPerNode=1 i'll do more rsl exploration and probably play with the coaster's generation of GRAM2 requests. -Allan 2009/7/21 Mihael Hategan : > On Tue, 2009-07-21 at 11:49 -0500, Allan Espinosa wrote: >> According to the gram logs, swift sends requests for blocks of 1, 2, 3 >> and 4 nodes but SGE receives requests for ?four 1 node jobs. ? This >> maybe a GRAM2-SGE interaction problem. ?Is there a way to get the >> globus RSL files from swift so I can submit manually and verify this? > > In cog/modules/coaster/resources/log4.properties add: > log4j.logger.org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG > > Then re-compile. > > But I don't think you need to go that far. Write your own RSL. In > particular I'd suggest trying with both jobType=multiple and without. > > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Tue Jul 21 13:20:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Jul 2009 13:20:32 -0500 Subject: [Swift-devel] more on # of coasters workers vs actual requested on ranger In-Reply-To: <50b07b4b0907211113l4e554679j3de4743a7030fe85@mail.gmail.com> References: <50b07b4b0907210949w58fc6a59l34b937ff52e6c44b@mail.gmail.com> <1248195598.12972.6.camel@localhost> <50b07b4b0907211113l4e554679j3de4743a7030fe85@mail.gmail.com> Message-ID: <1248200432.16593.6.camel@localhost> On Tue, 2009-07-21 at 13:13 -0500, Allan Espinosa wrote: > aha. > > on Ranger the count clause , refers to the number of cpus hence when > coasters is requesting for count=4 it only needs 1 node. if we want > to do a workersPerNode=16 then we should manually specify host_count=4 > instead of count=4. or just use workersPerNode=1 Ah, right. I remember this funny problem. Can you find out how well this is supported in general? The gram docs are a bit vague: (hostCount=value) Only applies to clusters of SMP computers, such as newer IBM SP systems. Defines the number of nodes ("pizza boxes") to distribute the "count" processes across. > > i'll do more rsl exploration and probably play with the coaster's > generation of GRAM2 requests. > > -Allan > > 2009/7/21 Mihael Hategan : > > On Tue, 2009-07-21 at 11:49 -0500, Allan Espinosa wrote: > >> According to the gram logs, swift sends requests for blocks of 1, 2, 3 > >> and 4 nodes but SGE receives requests for four 1 node jobs. This > >> maybe a GRAM2-SGE interaction problem. Is there a way to get the > >> globus RSL files from swift so I can submit manually and verify this? > > > > In cog/modules/coaster/resources/log4.properties add: > > log4j.logger.org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler=DEBUG > > > > Then re-compile. > > > > But I don't think you need to go that far. Write your own RSL. In > > particular I'd suggest trying with both jobType=multiple and without. > > > > > > > > > > > From wilde at mcs.anl.gov Tue Jul 21 17:23:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Jul 2009 17:23:20 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> Message-ID: <4A663FD8.3050909@mcs.anl.gov> Yes, there are a few we can run on QueenBee. Can try to test next week. Allan, we can test SEE/AMPL, OOPS, and PTMap there. - Mike On 7/21/09 10:58 AM, Stuart Martin wrote: > Are there any swift apps that can use queen bee? There is a GRAM5 > service setup there for testing. > > -Stu > > Begin forwarded message: > >> From: Stuart Martin >> Date: July 21, 2009 10:56:04 AM CDT >> To: gateways at teragrid.org >> Cc: Stuart Martin , Lukasz Lacinski >> >> Subject: Fwd: [gram-user] GRAM5 Alpha2 >> >> Hi Gateways, >> >> Any gateways that use (or can use) Queen Bee, it would be great if you >> could target this new GRAM5 service that Lukasz deployed. I heard >> from Lukasz that Jim has submitted a gateway user (SAML) job and that >> went through fine and populate the gram audit DB correctly. Thanks >> Jim! It would be nice to have some gateway push the service to test >> scalability. >> >> Let us know if you plan to do this. >> >> Thanks, >> Stu >> >> Begin forwarded message: >> >>> From: Lukasz Lacinski >>> Date: July 21, 2009 1:18:05 AM CDT >>> To: gram-user at lists.globus.org >>> Subject: [gram-user] GRAM5 Alpha2 >>> >>> I've installed GRAM5 Alpha2 on Queen Bee. >>> >>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>> >>> -seg-module pbs works fine. >>> GRAM audit with PostgreSQL works fine. >>> >>> Can someone submit jobs as a gateway user? I'd like to check if the >>> gateway_user field is written to our audit database. >>> >>> Thanks, >>> Lukasz >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Thu Jul 23 11:40:15 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 23 Jul 2009 11:40:15 -0500 Subject: [Swift-devel] coaster workers not receiving enough jobs Message-ID: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> I tried 0660-many.swift with 200 5min sleep jobs using local:local mode (since queue on ranger and teraport takes a while to finish). The session spawned 192 workers. Swift reports at most 36 active processes at a time (which it finished successfully). After that workers reach idle time exceptions. Logs and stuff are in ~aespinosa/workflows/coaster_debug/run1/ sites.xml: /home/aespinosa/workflows/coaster_debug/workdir 10000 1.98 1 00:05:00 3600 swift session: Swift svn swift-r3011 cog-r2439 RunID: locallog Progress: Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1 Progress: Selecting site:1 Submitting:198 Submitted:1 Progress: Selecting site:1 Submitted:198 Active:1 Progress: Selecting site:1 Submitted:192 Active:7 Progress: Selecting site:1 Submitted:188 Active:11 Progress: Selecting site:1 Submitted:181 Active:18 Progress: Selecting site:1 Submitted:178 Active:21 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:36 Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1 Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8 Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16 Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:134 Active:30 Finished successfully:36 Progress: Submitted:133 Active:31 Finished successfully:36 Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost Failed to transfer wrapper log from 066-many-locallog/info/l on localhost Failed to transfer wrapper log from 066-many-locallog/info/k on localhost Failed to transfer wrapper log from 066-many-locallog/info/n on localhost Failed to transfer wrapper log from 066-many-locallog/info/o on localhost Failed to transfer wrapper log from 066-many-locallog/info/q on localhost ailed to transfer wrapper log from 066-many-locallog/info/c on localhost Failed to transfer wrapper log from 066-many-locallog/info/m on localhost Failed to transfer wrapper log from 066-many-locallog/info/i on localhost Failed to transfer wrapper log from 066-many-locallog/info/p on localhost Failed to transfer wrapper log from 066-many-locallog/info/a on localhost Progress: Stage in:11 Submitting:34 Submitted:113 Active:6 Finished successfully:36 Progress: Submitted:157 Active:7 Finished successfully:36 Failed to transfer wrapper log from 066-many-locallog/info/t on localhost Failed to transfer wrapper log from 066-many-locallog/info/u on localhost Failed to transfer wrapper log from 066-many-locallog/info/v on localhost Failed to transfer wrapper log from 066-many-locallog/info/x on localhost Failed to transfer wrapper log from 066-many-locallog/info/r on localhost Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 Progress: Submitted:163 Active:1 Finished successfully:36 ... ... (not yet finished) $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail 65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1248364974288-1248364979260-1248364979261) setting status to Active workerid=000055 66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1248364974280-1248364979248-1248364979249) setting status to Active workerid=000051 $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail 61 worker-0723-021156-000001.log:1248365290 000054 < len=9, actuallen=9, tag=1, flags=0, SUBMITJOB 62 worker-0723-021156-000001.log:1248365290 000050 < len=9, actuallen=9, tag=1, flags=0, SUBMITJOB 63 worker-0723-021156-000001.log:1248365290 000053 < len=9, actuallen=9, tag=1, flags=0, SUBMITJOB 64 worker-0723-021156-000001.log:1248365290 000052 < len=9, actuallen=9, tag=1, flags=0, SUBMITJOB 65 worker-0723-021156-000001.log:1248365290 000051 < len=9, actuallen=9, tag=1, flags=0, SUBMITJOB 66 worker-0723-021156-000001.log:1248365290 000055 < len=9, actuallen=9, tag=1, flags=0, SUBMITJOB it corresponds correctly with the swift session (more or less) since we had 30+ completed jobs. Some lines in coasters.log i find intersting: 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1248364974290-1248364979263-1248364979264) setting status to Submitted 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1248364974290-1248364979263-1248364979264) setting status to Active 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106, JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1) 2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS) CMD: Command(106, JOBSTATUS) 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1248364974290-1248364979263-1248364979264) setting status to Failed Block ta sk failed: 0723-021156-000001Block task ended prematurely Statement unlikely to be reached at /home/aespinosa/.globus/coasters/cscript15423.pl line 580. (Maybe you meant system() when you said exec()?) 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107, JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1) 2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS) CMD: Command(107, JOBSTATUS) -Allan From hategan at mcs.anl.gov Thu Jul 23 11:49:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Jul 2009 11:49:33 -0500 Subject: [Swift-devel] coaster workers not receiving enough jobs In-Reply-To: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> Message-ID: <1248367773.25313.5.camel@localhost> On Thu, 2009-07-23 at 11:40 -0500, Allan Espinosa wrote: > Some lines in coasters.log i find intersting: > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1248364974290-1248364979263-1248364979264) setting status > to Failed Block ta > sk failed: 0723-021156-000001Block task ended prematurely > > Statement unlikely to be reached at > /home/aespinosa/.globus/coasters/cscript15423.pl line 580. > (Maybe you meant system() when you said exec()?) > I think perl is being extra-cautios there. The sequence of commands is the following: exec { $executable } @JOBARGS; print $WR "Could not execute $executable: $!\n"; die "Could not execute $executable: $!"; If exec succeeds, the print statement is indeed unreachable. However, it is there to deal with the case when exec doesn't succeed. There are ways to write it to avoid that warning, but that warning isn't indicative of an actual problem here. From aespinosa at cs.uchicago.edu Thu Jul 23 12:08:17 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 23 Jul 2009 12:08:17 -0500 Subject: [Swift-devel] coaster workers not receiving enough jobs In-Reply-To: <1248367773.25313.5.camel@localhost> References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> <1248367773.25313.5.camel@localhost> Message-ID: <50b07b4b0907231008qb309b6eu162eba00cf389663@mail.gmail.com> Ah right. Reading more into logs and code, my guess is that there is not enough Cpu.pull() calls to get jobs from the coaster service: $ grep pull coasters.log | grep -v Later | cat -n 62 2009-07-23 11:08:09,813-0500 INFO Cpu 0723-021156-000001:51 pull 63 2009-07-23 11:08:09,814-0500 INFO Cpu 0723-021156-000001:52 pull 64 2009-07-23 11:08:09,841-0500 INFO Cpu 0723-021156-000001:53 pull 65 2009-07-23 11:08:09,918-0500 INFO Cpu 0723-021156-000001:54 pull 66 2009-07-23 11:08:09,968-0500 INFO Cpu 0723-021156-000001:55 pull 67 2009-07-23 11:12:06,079-0500 INFO Cpu 0723-021156-000001:56 pull These pull() calls get invoked in the bunch of cpus in the pullthread correct? I'll read up on pullthreads and try to figure things out. -Allan 2009/7/23 Mihael Hategan : > On Thu, 2009-07-23 at 11:40 -0500, Allan Espinosa wrote: > > > There are ways to write it to avoid that warning, but that warning isn't > indicative of an actual problem here. > From hategan at mcs.anl.gov Thu Jul 23 12:35:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Jul 2009 12:35:44 -0500 Subject: [Swift-devel] coaster workers not receiving enough jobs In-Reply-To: <50b07b4b0907231008qb309b6eu162eba00cf389663@mail.gmail.com> References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> <1248367773.25313.5.camel@localhost> <50b07b4b0907231008qb309b6eu162eba00cf389663@mail.gmail.com> Message-ID: <1248370544.27943.1.camel@localhost> On Thu, 2009-07-23 at 12:08 -0500, Allan Espinosa wrote: > Ah right. > > Reading more into logs and code, my guess is that there is not enough > Cpu.pull() calls to get jobs from the coaster service: > > $ grep pull coasters.log | grep -v Later | cat -n > 62 2009-07-23 11:08:09,813-0500 INFO Cpu 0723-021156-000001:51 pull > 63 2009-07-23 11:08:09,814-0500 INFO Cpu 0723-021156-000001:52 pull > 64 2009-07-23 11:08:09,841-0500 INFO Cpu 0723-021156-000001:53 pull > 65 2009-07-23 11:08:09,918-0500 INFO Cpu 0723-021156-000001:54 pull > 66 2009-07-23 11:08:09,968-0500 INFO Cpu 0723-021156-000001:55 pull > 67 2009-07-23 11:12:06,079-0500 INFO Cpu 0723-021156-000001:56 pull > > These pull() calls get invoked in the bunch of cpus in the pullthread > correct? I'll read up on pullthreads I don't think there's some official kind of "pullthread". It's a separate thread I wrote in order to allow waiting and avoid deadlocks. > and try to figure things out. > > -Allan > > 2009/7/23 Mihael Hategan : > > On Thu, 2009-07-23 at 11:40 -0500, Allan Espinosa wrote: > > > > > > There are ways to write it to avoid that warning, but that warning isn't > > indicative of an actual problem here. > > From andric at uchicago.edu Thu Jul 23 13:13:40 2009 From: andric at uchicago.edu (Michael Andric) Date: Thu, 23 Jul 2009 13:13:40 -0500 Subject: [Swift-devel] errors from HNL machines/swift Message-ID: HI Support, Swift dev, anyone else reading, I keep getting this crash on swift jobs submitted from HNL machines (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). These happen for different workflows, involving different processes. I am totally in the dark as to what this error is referring to as well as to what may be causing it. This crash has occurred on workflows that have just gone 'Active' as well as on workflows that were running for hours before crashing. Below is the error message. The log file is too big to attach but can be found here: /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu) Any insight is hugely appreciated - like i said, i don't even know what to debug b/c i don't know what the error is referring to. Michael Progress: Submitted:11 Active:1 Progress: Active:10 Stage out:2 # # An unexpected error has been detected by HotSpot Virtual Machine: # # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208 # # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed mode, sharing) # Problematic frame: # C [libzip.so+0xfa62] # # An error report file with more information is saved as hs_err_pid32310.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 Aborted java -Xmx2048M -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= andrew.bsd.uchicago.edu -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/.. -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/.. -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/.. -Djava.security.egd=file:///dev/urandom -Xmx1024m -classpath /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file' '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric' -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jul 23 13:20:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Jul 2009 13:20:16 -0500 Subject: [Swift-devel] errors from HNL machines/swift In-Reply-To: References: Message-ID: <1248373216.28628.1.camel@localhost> There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to see that. Otherwise it seems related to this: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352 You could try a newer JVM and see if the problem persists. On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote: > HI Support, Swift dev, anyone else reading, > > I keep getting this crash on swift jobs submitted from HNL machines > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). These > happen for different workflows, involving different processes. I am > totally in the dark as to what this error is referring to as well as > to what may be causing it. This crash has occurred on workflows that > have just gone 'Active' as well as on workflows that were running for > hours before crashing. > > > Below is the error message. The log file is too big to attach but can > be found here: > /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log > from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu) > > > Any insight is hugely appreciated - like i said, i don't even know > what to debug b/c i don't know what the error is referring to. > Michael > > > > > > > > Progress: Submitted:11 Active:1 > Progress: Active:10 Stage out:2 > # > # An unexpected error has been detected by HotSpot Virtual Machine: > # > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208 > # > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed mode, > sharing) > # Problematic frame: > # C [libzip.so+0xfa62] > # > # An error report file with more information is saved as > hs_err_pid32310.log > # > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 Aborted > java -Xmx2048M > -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000 > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/.. -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/.. -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/.. -Djava.security.egd=file:///dev/urandom -Xmx1024m -classpath /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file' '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric' > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From andric at uchicago.edu Thu Jul 23 15:23:03 2009 From: andric at uchicago.edu (Michael Andric) Date: Thu, 23 Jul 2009 15:23:03 -0500 Subject: [Swift-devel] errors from HNL machines/swift In-Reply-To: <1248373216.28628.1.camel@localhost> References: <1248373216.28628.1.camel@localhost> Message-ID: there are a couple here: andrew.bsd.uchicago.edu:/tmp/hs*.log On Thu, Jul 23, 2009 at 1:20 PM, Mihael Hategan wrote: > There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to see that. > > Otherwise it seems related to this: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352 > > You could try a newer JVM and see if the problem persists. > > On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote: > > HI Support, Swift dev, anyone else reading, > > > > I keep getting this crash on swift jobs submitted from HNL machines > > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). These > > happen for different workflows, involving different processes. I am > > totally in the dark as to what this error is referring to as well as > > to what may be causing it. This crash has occurred on workflows that > > have just gone 'Active' as well as on workflows that were running for > > hours before crashing. > > > > > > Below is the error message. The log file is too big to attach but can > > be found here: > > > /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log > > from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu) > > > > > > Any insight is hugely appreciated - like i said, i don't even know > > what to debug b/c i don't know what the error is referring to. > > Michael > > > > > > > > > > > > > > > > Progress: Submitted:11 Active:1 > > Progress: Active:10 Stage out:2 > > # > > # An unexpected error has been detected by HotSpot Virtual Machine: > > # > > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208 > > # > > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed mode, > > sharing) > > # Problematic frame: > > # C [libzip.so+0xfa62] > > # > > # An error report file with more information is saved as > > hs_err_pid32310.log > > # > > # If you would like to submit a bug report, please visit: > > # http://java.sun.com/webapps/bugreport/crash.jsp > > # > > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 Aborted > > java -Xmx2048M > > -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed > > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000 > > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu-DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/.. > -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/.. > -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/.. > -Djava.security.egd=file:///dev/urandom -Xmx1024m -classpath > /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar: > org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file' > '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric' > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jul 23 15:33:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Jul 2009 15:33:23 -0500 Subject: [Swift-devel] errors from HNL machines/swift In-Reply-To: References: <1248373216.28628.1.camel@localhost> Message-ID: <1248381203.32020.0.camel@localhost> Can't help you much there. It seems to be a bug in the JVM. Again, I'd try other versions of java. On Thu, 2009-07-23 at 15:23 -0500, Michael Andric wrote: > there are a couple here: andrew.bsd.uchicago.edu:/tmp/hs*.log > > On Thu, Jul 23, 2009 at 1:20 PM, Mihael Hategan > wrote: > There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to > see that. > > Otherwise it seems related to this: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352 > > You could try a newer JVM and see if the problem persists. > > > On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote: > > HI Support, Swift dev, anyone else reading, > > > > I keep getting this crash on swift jobs submitted from HNL > machines > > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). > These > > happen for different workflows, involving different > processes. I am > > totally in the dark as to what this error is referring to as > well as > > to what may be causing it. This crash has occurred on > workflows that > > have just gone 'Active' as well as on workflows that were > running for > > hours before crashing. > > > > > > Below is the error message. The log file is too big to > attach but can > > be found here: > > /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests-20090723-1113-na2cuboc.log > > from one of the HNL machines (e.g., gwynn.bsd.uchicago.edu) > > > > > > Any insight is hugely appreciated - like i said, i don't > even know > > what to debug b/c i don't know what the error is referring > to. > > Michael > > > > > > > > > > > > > > > > Progress: Submitted:11 Active:1 > > Progress: Active:10 Stage out:2 > > # > > # An unexpected error has been detected by HotSpot Virtual > Machine: > > # > > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, tid=2949090208 > > # > > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed > mode, > > sharing) > > # Problematic frame: > > # C [libzip.so+0xfa62] > > # > > # An error report file with more information is saved as > > hs_err_pid32310.log > > # > > # If you would like to submit a bug report, please visit: > > # http://java.sun.com/webapps/bugreport/crash.jsp > > # > > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 > Aborted > > java -Xmx2048M > > > -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed > > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000 > > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu > -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/.. > -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/.. > -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/.. > -Djava.security.egd=file:///dev/urandom -Xmx1024m > -classpath /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr-2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util-concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor-0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster-bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-abstraction-common-2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor-0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev-080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36-dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift-svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util-0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons-logging-1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix-asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram-utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core-0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce-jdk13-131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug-lgpl-2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j-1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming-resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xml-apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader 'tpChiSqTests.swift' '-sites.file' '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '-user=andric' > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From support at ci.uchicago.edu Fri Jul 24 08:50:15 2009 From: support at ci.uchicago.edu (Ti Leggett) Date: Fri, 24 Jul 2009 08:50:15 -0500 Subject: [Swift-devel] [CI Ticketing System #1372] errors from HNL machines/swift In-Reply-To: <1248381203.32020.0.camel@localhost> References: <1248373216.28628.1.camel@localhost> <1248381203.32020.0.camel@localhost> Message-ID: Try adding +java-1.6.0_03-sun-r1 above any other lines in your ~/.soft and run resoft. See if that helps your issues. On Thu Jul 23 15:33:34 2009, hategan at mcs.anl.gov wrote: > Can't help you much there. It seems to be a bug in the JVM. Again, I'd > try other versions of java. > > On Thu, 2009-07-23 at 15:23 -0500, Michael Andric wrote: > > there are a couple here: andrew.bsd.uchicago.edu:/tmp/hs*.log > > > > On Thu, Jul 23, 2009 at 1:20 PM, Mihael Hategan > > > wrote: > > There's a JVM dump, namely ?hs_err_pid32310.log. I'd like to > > see that. > > > > Otherwise it seems related to this: > > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6390352 > > > > You could try a newer JVM and see if the problem persists. > > > > > > On Thu, 2009-07-23 at 13:13 -0500, Michael Andric wrote: > > > HI Support, Swift dev, anyone else reading, > > > > > > I keep getting this crash on swift jobs submitted from HNL > > machines > > > (both andrew.bsd.uchicago.edu and gwynn.bsd.uchicago.edu). > > These > > > happen for different workflows, involving different > > processes. I am > > > totally in the dark as to what this error is referring to > as > > well as > > > to what may be causing it. This crash has occurred on > > workflows that > > > have just gone 'Active' as well as on workflows that were > > running for > > > hours before crashing. > > > > > > > > > Below is the error message. The log file is too big to > > attach but can > > > be found here: > > > > /gpfs/pads/fmri/cnari/swift/projects/andric/peakfit_pilots/PK2/turnpointAnalysis/tpChiSqTests- > 20090723-1113-na2cuboc.log > > > from one of the HNL machines (e.g., > gwynn.bsd.uchicago.edu) > > > > > > > > > Any insight is hugely appreciated - like i said, i don't > > even know > > > what to debug b/c i don't know what the error is referring > > to. > > > Michael > > > > > > > > > > > > > > > > > > > > > > > > Progress: Submitted:11 Active:1 > > > Progress: Active:10 Stage out:2 > > > # > > > # An unexpected error has been detected by HotSpot Virtual > > Machine: > > > # > > > # SIGBUS (0x7) at pc=0xb75b9a62, pid=32310, > tid=2949090208 > > > # > > > # Java VM: Java HotSpot(TM) Client VM (1.5.0_06-b05 mixed > > mode, > > > sharing) > > > # Problematic frame: > > > # C [libzip.so+0xfa62] > > > # > > > # An error report file with more information is saved as > > > hs_err_pid32310.log > > > # > > > # If you would like to submit a bug report, please visit: > > > # http://java.sun.com/webapps/bugreport/crash.jsp > > > # > > > /gpfs/pads/fmri/apps/swift/bin/swift: line 100: 32310 > > Aborted > > > java -Xmx2048M > > > > > > -Djava.endorsed.dirs=/gpfs/pads/fmri/apps/swift/bin/../lib/endorsed > > > -DUID=1309 -DGLOBUS_TCP_PORT_RANGE=50000,51000 > > > -DGLOBUS_HOSTNAME=andrew.bsd.uchicago.edu > > -DCOG_INSTALL_PATH=/gpfs/pads/fmri/apps/swift/bin/.. > > -Dvds.home=/gpfs/pads/fmri/apps/swift/bin/.. > > -Dswift.home=/gpfs/pads/fmri/apps/swift/bin/.. > > -Djava.security.egd=file:///dev/urandom -Xmx1024m > > -classpath > /gpfs/pads/fmri/apps/swift/bin/../etc:/gpfs/pads/fmri/apps/swift/bin/../libexec:/gpfs/pads/fmri/apps/swift/bin/../lib/addressing- > 1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/ant.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/antlr- > 2.7.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/axis- > url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/backport-util- > concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/castor- > 0.9.6.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/coaster- > bootstrap.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog- > abstraction-common- > 2.3.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog- > axis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-grapheditor- > 0.47.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-jglobus-dev- > 080222.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-karajan-0.36- > dev.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider-clref- > gt4_0_0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider- > coaster-0.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider- > dcache-0.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider- > gt2-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider- > gt4_0_0-2.5.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider- > local-2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-provider- > localscheduler-0.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog- > provider-ssh-2.4.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog- > provider-webdav-2.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog- > resources-1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-swift- > svn.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-trap- > 1.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog- > url.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cog-util- > 0.92.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commonj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons- > beanutils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons- > collections-3.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons- > digester.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons- > discovery.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons- > httpclient.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/commons- > logging- > 1.1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/concurrent.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix32.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix- > asn1.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/cryptix.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_delegation_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/globus_wsrf_rft_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram- > client.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram- > stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gram- > utils.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/gvds.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh- > common-0.2.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/j2ssh-core- > 0.2.2-patched.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta- > regexp-1.2.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jakarta-slide- > webdavlib- > 2.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jaxrpc.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jce- > jdk13- > 131.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jgss.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jsr173_1.0_api.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/jug- > lgpl- > 2.0.0.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/junit.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/log4j- > 1.2.8.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming- > common.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming- > factory.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming- > java.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/naming- > resources.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/opensaml.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/puretls.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/resolver.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/saaj.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/stringtemplate.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/vdldefinitions.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsdl4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_core_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_index_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_provider_jce.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wsrf_tools.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/wss4j.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xalan.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xbean_xpath.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xercesImpl.jar:/gpf s/pads/fmri/apps/swift/bin/../lib/xml- > apis.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xmlsec.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xpp3- > 1.1.3.4d_b4_min.jar:/gpfs/pads/fmri/apps/swift/bin/../lib/xstream- > 1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader > 'tpChiSqTests.swift' '-sites.file' > '/gpfs/pads/fmri/cnari_svn/config/coaster_ranger.xml' '- > user=andric' > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From skenny at uchicago.edu Fri Jul 24 13:25:01 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 24 Jul 2009 13:25:01 -0500 (CDT) Subject: [Swift-devel] remote file mapping Message-ID: <20090724132501.CAO39274@m4500-02.uchicago.edu> does anyone know the syntax for mapping a file on a remote machine? i'm told it's possible but couldn't find it in the doc. thnx ~sk From hategan at mcs.anl.gov Fri Jul 24 13:47:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Jul 2009 13:47:58 -0500 Subject: [Swift-devel] remote file mapping In-Reply-To: <20090724132501.CAO39274@m4500-02.uchicago.edu> References: <20090724132501.CAO39274@m4500-02.uchicago.edu> Message-ID: <1248461278.20398.0.camel@localhost> On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote: > does anyone know the syntax for mapping a file on a remote > machine? i'm told it's possible but couldn't find it in the doc. You should be able to use a URL in any of the mappers. Like "gsiftp://example.org/file". From skenny at uchicago.edu Fri Jul 24 15:14:11 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 24 Jul 2009 15:14:11 -0500 (CDT) Subject: [Swift-devel] remote file mapping In-Reply-To: <1248461278.20398.0.camel@localhost> References: <20090724132501.CAO39274@m4500-02.uchicago.edu> <1248461278.20398.0.camel@localhost> Message-ID: <20090724151411.CAO50741@m4500-02.uchicago.edu> thanks mihael! so, do you happen to know, would this mean that the gridftp server on the remote machine is configured to only accept requests from localhost? RunID: 20090724-1504-5khkoyd7 Progress: Execution failed: java.lang.RuntimeException: java.lang.RuntimeException: Could not instantiate file resource Caused by: Error communicating with the GridFTP server Caused by: Authentication failed [Caused by: Operation unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. Expected "/CN=host/localhost.localdomain" target but received "/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")] [skenny at sidgrid urltest]$ ---- Original message ---- >Date: Fri, 24 Jul 2009 13:47:58 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] remote file mapping >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu > >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote: >> does anyone know the syntax for mapping a file on a remote >> machine? i'm told it's possible but couldn't find it in the doc. > >You should be able to use a URL in any of the mappers. Like >"gsiftp://example.org/file". > > From hategan at mcs.anl.gov Fri Jul 24 16:10:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Jul 2009 16:10:55 -0500 Subject: [Swift-devel] remote file mapping In-Reply-To: <20090724151411.CAO50741@m4500-02.uchicago.edu> References: <20090724132501.CAO39274@m4500-02.uchicago.edu> <1248461278.20398.0.camel@localhost> <20090724151411.CAO50741@m4500-02.uchicago.edu> Message-ID: <1248469855.23180.0.camel@localhost> On Fri, 2009-07-24 at 15:14 -0500, skenny at uchicago.edu wrote: > thanks mihael! > > so, do you happen to know, would this mean that the gridftp > server on the remote machine is configured to only accept > requests from localhost? Are you submitting from sidgrid.ci.uchicago.edu to sidgrid.ci.uchicago.edu? What does your script look like? > > RunID: 20090724-1504-5khkoyd7 > Progress: > Execution failed: > java.lang.RuntimeException: > java.lang.RuntimeException: Could not instantiate file resource > Caused by: > Error communicating with the GridFTP server > Caused by: > Authentication failed [Caused by: Operation > unauthorized (Mechanism level: [JGLOBUS-56] Authorization > failed. Expected "/CN=host/localhost.localdomain" target but > received > "/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")] > [skenny at sidgrid urltest]$ > > > ---- Original message ---- > >Date: Fri, 24 Jul 2009 13:47:58 -0500 > >From: Mihael Hategan > >Subject: Re: [Swift-devel] remote file mapping > >To: skenny at uchicago.edu > >Cc: swift-devel at ci.uchicago.edu > > > >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote: > >> does anyone know the syntax for mapping a file on a remote > >> machine? i'm told it's possible but couldn't find it in the > doc. > > > >You should be able to use a URL in any of the mappers. Like > >"gsiftp://example.org/file". > > > > From skenny at uchicago.edu Fri Jul 24 16:35:09 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 24 Jul 2009 16:35:09 -0500 (CDT) Subject: [Swift-devel] remote file mapping Message-ID: <20090724163509.CAO58555@m4500-02.uchicago.edu> i'm on sidgrid, trying to gftp a file from andrew.bsd.uchicago.edu to ranger. the script looks like this: file covMatrix; Rscript mxScript; int totalperms[] = [1:100]; float initweight = .5; foreach perm in totalperms{ mxModel modmin; modmin = mxModelProcessor(covMatrix, mxScript, perm, initweight, "speech"); but this is failing as well: [skenny at sidgrid urltest]$ globus-url-copy gsiftp://andrew.bsd.uchicago.edu/tmp/gestspeech.cov gsiftp://gridftp.ranger.tacc.teragrid.org:2811/guc.test GlobusUrlCopy error: UrlCopy third party transfer failed. [Caused by: Connection refused] ---- Original message ---- >Date: Fri, 24 Jul 2009 16:10:55 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] remote file mapping >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu > >On Fri, 2009-07-24 at 15:14 -0500, skenny at uchicago.edu wrote: >> thanks mihael! >> >> so, do you happen to know, would this mean that the gridftp >> server on the remote machine is configured to only accept >> requests from localhost? > >Are you submitting from sidgrid.ci.uchicago.edu to >sidgrid.ci.uchicago.edu? > >What does your script look like? > >> >> RunID: 20090724-1504-5khkoyd7 >> Progress: >> Execution failed: >> java.lang.RuntimeException: >> java.lang.RuntimeException: Could not instantiate file resource >> Caused by: >> Error communicating with the GridFTP server >> Caused by: >> Authentication failed [Caused by: Operation >> unauthorized (Mechanism level: [JGLOBUS-56] Authorization >> failed. Expected "/CN=host/localhost.localdomain" target but >> received >> "/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")] >> [skenny at sidgrid urltest]$ >> >> >> ---- Original message ---- >> >Date: Fri, 24 Jul 2009 13:47:58 -0500 >> >From: Mihael Hategan >> >Subject: Re: [Swift-devel] remote file mapping >> >To: skenny at uchicago.edu >> >Cc: swift-devel at ci.uchicago.edu >> > >> >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote: >> >> does anyone know the syntax for mapping a file on a remote >> >> machine? i'm told it's possible but couldn't find it in the >> doc. >> > >> >You should be able to use a URL in any of the mappers. Like >> >"gsiftp://example.org/file". >> > >> > > From hategan at mcs.anl.gov Fri Jul 24 16:38:18 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Jul 2009 16:38:18 -0500 Subject: [Swift-devel] remote file mapping In-Reply-To: <20090724163509.CAO58555@m4500-02.uchicago.edu> References: <20090724163509.CAO58555@m4500-02.uchicago.edu> Message-ID: <1248471498.23879.1.camel@localhost> Wait, wait. Slow down. One problem at a time. On Fri, 2009-07-24 at 16:35 -0500, skenny at uchicago.edu wrote: > i'm on sidgrid, trying to gftp a file from > andrew.bsd.uchicago.edu to ranger. > > the script looks like this: > > file > covMatrix; Why "gsiftp:///" instead of "gsiftp://"? > Rscript > mxScript; > > int totalperms[] = [1:100]; > float initweight = .5; > foreach perm in totalperms{ > mxModel modmin file=@strcat("gsiftp:///andrew.ci.uchicago.edu/home/skenny/swift_runs/urltest/results/speech_",perm,".rdata")>; > modmin = mxModelProcessor(covMatrix, mxScript, perm, > initweight, "speech"); > > but this is failing as well: > > [skenny at sidgrid urltest]$ globus-url-copy > gsiftp://andrew.bsd.uchicago.edu/tmp/gestspeech.cov > gsiftp://gridftp.ranger.tacc.teragrid.org:2811/guc.test > GlobusUrlCopy error: UrlCopy third party transfer failed. > [Caused by: Connection refused] > > > > ---- Original message ---- > >Date: Fri, 24 Jul 2009 16:10:55 -0500 > >From: Mihael Hategan > >Subject: Re: [Swift-devel] remote file mapping > >To: skenny at uchicago.edu > >Cc: swift-devel at ci.uchicago.edu > > > >On Fri, 2009-07-24 at 15:14 -0500, skenny at uchicago.edu wrote: > >> thanks mihael! > >> > >> so, do you happen to know, would this mean that the gridftp > >> server on the remote machine is configured to only accept > >> requests from localhost? > > > >Are you submitting from sidgrid.ci.uchicago.edu to > >sidgrid.ci.uchicago.edu? > > > >What does your script look like? > > > >> > >> RunID: 20090724-1504-5khkoyd7 > >> Progress: > >> Execution failed: > >> java.lang.RuntimeException: > >> java.lang.RuntimeException: Could not instantiate file resource > >> Caused by: > >> Error communicating with the GridFTP server > >> Caused by: > >> Authentication failed [Caused by: Operation > >> unauthorized (Mechanism level: [JGLOBUS-56] Authorization > >> failed. Expected "/CN=host/localhost.localdomain" target but > >> received > >> > "/DC=edu/DC=uchicago/DC=ci/OU=hosts/CN=sidgrid.ci.uchicago.edu")] > >> [skenny at sidgrid urltest]$ > >> > >> > >> ---- Original message ---- > >> >Date: Fri, 24 Jul 2009 13:47:58 -0500 > >> >From: Mihael Hategan > >> >Subject: Re: [Swift-devel] remote file mapping > >> >To: skenny at uchicago.edu > >> >Cc: swift-devel at ci.uchicago.edu > >> > > >> >On Fri, 2009-07-24 at 13:25 -0500, skenny at uchicago.edu wrote: > >> >> does anyone know the syntax for mapping a file on a remote > >> >> machine? i'm told it's possible but couldn't find it in the > >> doc. > >> > > >> >You should be able to use a URL in any of the mappers. Like > >> >"gsiftp://example.org/file". > >> > > >> > > > From wilde at mcs.anl.gov Sun Jul 26 18:09:59 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 26 Jul 2009 18:09:59 -0500 Subject: [Swift-devel] Re: [Swift-user] XDTM In-Reply-To: References: Message-ID: <4A6CE247.4010105@mcs.anl.gov> Jamal, As Swift evolved from its early prototypes to a more mature system, the notion of XDTM evolved to one of mapping between filesystem-based structures and Swift in-memory data structures (ie, scalars, arrays, and structures, which can be nested and typed). This is best seen by looking at the "external" mapper, which allows a user to map a dataset using any external program (typically a script) that returns the members of the dataset as a two-column list: the Swift variable reference, and the external file or URI. See the user guide section on the external mapper: http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.ext_mapper (but the example in the user guide doesn't show the power of mapping to nested structures). In other words, it still has the flavor of XDTM, but without any XML being visible to the user. It meets the same need but is easier to use and explain. - Mike On 7/26/09 2:50 PM, J A wrote: > Hi All: > > Can any one direct me to a source with more examples/explanation on > how XDTM is working/implemented? > > Thanks, > Jamal > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Mon Jul 27 23:21:52 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 27 Jul 2009 23:21:52 -0500 Subject: [Swift-devel] Swift trunk seems to be broken Message-ID: <4A6E7CE0.10901@mcs.anl.gov> This script: com$ cat >t3.swift type d { int x; } com$ Gives: com$ swift t3.swift Swift svn swift-r3019 cog-r2445 RunID: 20090727-2313-2zka71if Execution failed: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to number: unbounded Caused by: For input string: "unbounded" com$ com$ java -version java version "1.5.0_06" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode) com$ Is anyone else seeing this problem? This fails for me on both communicado and on the BG/P. On the BG/P I tried with both Java 2.4 and Java 6; both failed the same way. - Mike From wilde at mcs.anl.gov Mon Jul 27 23:51:56 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 27 Jul 2009 23:51:56 -0500 Subject: [Swift-devel] Swift trunk seems to be broken In-Reply-To: <4A6E7CE0.10901@mcs.anl.gov> References: <4A6E7CE0.10901@mcs.anl.gov> Message-ID: <4A6E83EC.4070703@mcs.anl.gov> I think its cog rev 2440 thats causing the problem. 2440 fails: com$ swift t3.swift Swift svn swift-r3021 cog-r2440 RunID: 20090727-2333-dpf7v3ze Execution failed: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to number: unbounded Caused by: For input string: "unbounded" com$ cd - /home/wilde/swift/src/cog/modules/swift com$ com$ 2339 works: com$ swift t3.swift Swift svn swift-r3021 cog-r2439 RunID: 20090727-2337-g19sgr5f com$ 2440 is: com$ svn diff -r 2439:2440 Index: modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java =================================================================== --- modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java (revision 2439) +++ modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java (revision 2440) @@ -84,7 +84,13 @@ public boolean sys_equals(VariableStack stack) throws ExecutionException { Object[] args = getArguments(ARGS_2VALUES, stack); - return args[0].equals(args[1]); + if (args[0] instanceof Number) { + Number n2 = TypeUtil.toNumber(args[1]); + return ((Number) args[0]).doubleValue() == n2.doubleValue(); + } + else { + return args[0].equals(args[1]); + } Exception in log (example) is below. - Mike 2009-07-27 21:36:28,559-0500 INFO unknown Swift svn swift-r3019 (swift modified locally) cog-r2445 2009-07-27 21:36:28,561-0500 INFO unknown RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090727-2136-ibeu4gif 2009-07-27 21:36:28,719-0500 DEBUG VDL2ExecutionContext org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to number: unbounded org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to number: unbounded Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to number: unbounded at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:61) at org.globus.cog.karajan.workflow.nodes.functions.Misc.sys_equals(Misc.java:88) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:85) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:58) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:60) at java.lang.reflect.Method.invoke(Method.java:391) at org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:78) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled Code)) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled Code)) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled Code)) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined Compiled Code)) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled Code)) Caused by: java.lang.NumberFormatException: For input string: "unbounded" at java.lang.NumberFormatException.forInputString(NumberFormatException.java(Compiled Code)) at java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java(Compiled Code)) at java.lang.Double.valueOf(Double.java:227) at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:51) ... 25 more On 7/27/09 11:21 PM, Michael Wilde wrote: > This script: > > com$ cat >t3.swift > type d { > int x; > } > com$ > > Gives: > > com$ swift t3.swift > Swift svn swift-r3019 cog-r2445 > > RunID: 20090727-2313-2zka71if > Execution failed: > org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not > convert value to number: unbounded > Caused by: > For input string: "unbounded" > com$ > > > com$ java -version > java version "1.5.0_06" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) > Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode) > com$ > > > Is anyone else seeing this problem? > > This fails for me on both communicado and on the BG/P. > On the BG/P I tried with both Java 2.4 and Java 6; both failed the same > way. > > - Mike > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 28 00:10:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Jul 2009 00:10:59 -0500 Subject: [Swift-devel] Swift trunk seems to be broken In-Reply-To: <4A6E83EC.4070703@mcs.anl.gov> References: <4A6E7CE0.10901@mcs.anl.gov> <4A6E83EC.4070703@mcs.anl.gov> Message-ID: <1248757859.24917.0.camel@localhost> Fixed in cog r2446. On Mon, 2009-07-27 at 23:51 -0500, Michael Wilde wrote: > I think its cog rev 2440 thats causing the problem. > > 2440 fails: > > com$ swift t3.swift > Swift svn swift-r3021 cog-r2440 > > RunID: 20090727-2333-dpf7v3ze > Execution failed: > org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not > convert value to number: unbounded > Caused by: > For input string: "unbounded" > com$ cd - > /home/wilde/swift/src/cog/modules/swift > com$ > com$ > > > 2339 works: > > com$ swift t3.swift > Swift svn swift-r3021 cog-r2439 > > RunID: 20090727-2337-g19sgr5f > com$ > > > 2440 is: > > com$ svn diff -r 2439:2440 > Index: > modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java > =================================================================== > --- > modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java > (revision 2439) > +++ > modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java > (revision 2440) > @@ -84,7 +84,13 @@ > > public boolean sys_equals(VariableStack stack) throws > ExecutionException { > Object[] args = getArguments(ARGS_2VALUES, stack); > - return args[0].equals(args[1]); > + if (args[0] instanceof Number) { > + Number n2 = TypeUtil.toNumber(args[1]); > + return ((Number) args[0]).doubleValue() == n2.doubleValue(); > + } > + else { > + return args[0].equals(args[1]); > + } > > Exception in log (example) is below. > > - Mike > > 2009-07-27 21:36:28,559-0500 INFO unknown Swift svn swift-r3019 (swift > modified locally) cog-r2445 > > 2009-07-27 21:36:28,561-0500 INFO unknown RUNID > id=tag:benc at ci.uchicago.edu,2007:swift:run:20090727-2136-ibeu4gif > 2009-07-27 21:36:28,719-0500 DEBUG VDL2ExecutionContext > org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not > convert value to number: unbounded > org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not > convert value to number: unbounded > Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: > Could not convert value to number: unbounded > at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:61) > at > org.globus.cog.karajan.workflow.nodes.functions.Misc.sys_equals(Misc.java:88) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:85) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:58) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:60) > at java.lang.reflect.Method.invoke(Method.java:391) > at > org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:78) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined > Compiled Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined > Compiled Code)) > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled > Code)) > Caused by: java.lang.NumberFormatException: For input string: "unbounded" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java(Compiled > Code)) > at > java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java(Compiled > Code)) > at java.lang.Double.valueOf(Double.java:227) > at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:51) > ... 25 more > > > > > > On 7/27/09 11:21 PM, Michael Wilde wrote: > > This script: > > > > com$ cat >t3.swift > > type d { > > int x; > > } > > com$ > > > > Gives: > > > > com$ swift t3.swift > > Swift svn swift-r3019 cog-r2445 > > > > RunID: 20090727-2313-2zka71if > > Execution failed: > > org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not > > convert value to number: unbounded > > Caused by: > > For input string: "unbounded" > > com$ > > > > > > com$ java -version > > java version "1.5.0_06" > > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) > > Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode) > > com$ > > > > > > Is anyone else seeing this problem? > > > > This fails for me on both communicado and on the BG/P. > > On the BG/P I tried with both Java 2.4 and Java 6; both failed the same > > way. > > > > - Mike > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 28 00:38:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Jul 2009 00:38:20 -0500 Subject: [Swift-devel] Swift trunk seems to be broken In-Reply-To: <1248757859.24917.0.camel@localhost> References: <4A6E7CE0.10901@mcs.anl.gov> <4A6E83EC.4070703@mcs.anl.gov> <1248757859.24917.0.camel@localhost> Message-ID: <4A6E8ECC.1080102@mcs.anl.gov> That works - thanks! Glen, please try your latest OOPS run again now. - Mike On 7/28/09 12:10 AM, Mihael Hategan wrote: > Fixed in cog r2446. > > On Mon, 2009-07-27 at 23:51 -0500, Michael Wilde wrote: >> I think its cog rev 2440 thats causing the problem. >> >> 2440 fails: >> >> com$ swift t3.swift >> Swift svn swift-r3021 cog-r2440 >> >> RunID: 20090727-2333-dpf7v3ze >> Execution failed: >> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not >> convert value to number: unbounded >> Caused by: >> For input string: "unbounded" >> com$ cd - >> /home/wilde/swift/src/cog/modules/swift >> com$ >> com$ >> >> >> 2339 works: >> >> com$ swift t3.swift >> Swift svn swift-r3021 cog-r2439 >> >> RunID: 20090727-2337-g19sgr5f >> com$ >> >> >> 2440 is: >> >> com$ svn diff -r 2439:2440 >> Index: >> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java >> =================================================================== >> --- >> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java >> (revision 2439) >> +++ >> modules/karajan/src/org/globus/cog/karajan/workflow/nodes/functions/Misc.java >> (revision 2440) >> @@ -84,7 +84,13 @@ >> >> public boolean sys_equals(VariableStack stack) throws >> ExecutionException { >> Object[] args = getArguments(ARGS_2VALUES, stack); >> - return args[0].equals(args[1]); >> + if (args[0] instanceof Number) { >> + Number n2 = TypeUtil.toNumber(args[1]); >> + return ((Number) args[0]).doubleValue() == n2.doubleValue(); >> + } >> + else { >> + return args[0].equals(args[1]); >> + } >> >> Exception in log (example) is below. >> >> - Mike >> >> 2009-07-27 21:36:28,559-0500 INFO unknown Swift svn swift-r3019 (swift >> modified locally) cog-r2445 >> >> 2009-07-27 21:36:28,561-0500 INFO unknown RUNID >> id=tag:benc at ci.uchicago.edu,2007:swift:run:20090727-2136-ibeu4gif >> 2009-07-27 21:36:28,719-0500 DEBUG VDL2ExecutionContext >> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not >> convert value to number: unbounded >> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not >> convert value to number: unbounded >> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: >> Could not convert value to number: unbounded >> at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:61) >> at >> org.globus.cog.karajan.workflow.nodes.functions.Misc.sys_equals(Misc.java:88) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:85) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:58) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:60) >> at java.lang.reflect.Method.invoke(Method.java:391) >> at >> org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:78) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined >> Compiled Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined >> Compiled Code)) >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled >> Code)) >> Caused by: java.lang.NumberFormatException: For input string: "unbounded" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java(Compiled >> Code)) >> at >> java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java(Compiled >> Code)) >> at java.lang.Double.valueOf(Double.java:227) >> at org.globus.cog.karajan.util.TypeUtil.toNumber(TypeUtil.java:51) >> ... 25 more >> >> >> >> >> >> On 7/27/09 11:21 PM, Michael Wilde wrote: >>> This script: >>> >>> com$ cat >t3.swift >>> type d { >>> int x; >>> } >>> com$ >>> >>> Gives: >>> >>> com$ swift t3.swift >>> Swift svn swift-r3019 cog-r2445 >>> >>> RunID: 20090727-2313-2zka71if >>> Execution failed: >>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not >>> convert value to number: unbounded >>> Caused by: >>> For input string: "unbounded" >>> com$ >>> >>> >>> com$ java -version >>> java version "1.5.0_06" >>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) >>> Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode) >>> com$ >>> >>> >>> Is anyone else seeing this problem? >>> >>> This fails for me on both communicado and on the BG/P. >>> On the BG/P I tried with both Java 2.4 and Java 6; both failed the same >>> way. >>> >>> - Mike >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From smartin at mcs.anl.gov Tue Jul 28 09:26:30 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 28 Jul 2009 09:26:30 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: <4A663FD8.3050909@mcs.anl.gov> References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <4A663FD8.3050909@mcs.anl.gov> Message-ID: Hi Mike, Just following up on this. Will there be some swift use of GRAM5 on queen bee this week? -Stu On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote: > Yes, there are a few we can run on QueenBee. > > Can try to test next week. > > Allan, we can test SEE/AMPL, OOPS, and PTMap there. > > - Mike > > > On 7/21/09 10:58 AM, Stuart Martin wrote: >> Are there any swift apps that can use queen bee? There is a GRAM5 >> service setup there for testing. >> -Stu >> Begin forwarded message: >>> From: Stuart Martin >>> Date: July 21, 2009 10:56:04 AM CDT >>> To: gateways at teragrid.org >>> Cc: Stuart Martin , Lukasz Lacinski >> > >>> Subject: Fwd: [gram-user] GRAM5 Alpha2 >>> >>> Hi Gateways, >>> >>> Any gateways that use (or can use) Queen Bee, it would be great if >>> you could target this new GRAM5 service that Lukasz deployed. I >>> heard from Lukasz that Jim has submitted a gateway user (SAML) job >>> and that went through fine and populate the gram audit DB >>> correctly. Thanks Jim! It would be nice to have some gateway >>> push the service to test scalability. >>> >>> Let us know if you plan to do this. >>> >>> Thanks, >>> Stu >>> >>> Begin forwarded message: >>> >>>> From: Lukasz Lacinski >>>> Date: July 21, 2009 1:18:05 AM CDT >>>> To: gram-user at lists.globus.org >>>> Subject: [gram-user] GRAM5 Alpha2 >>>> >>>> I've installed GRAM5 Alpha2 on Queen Bee. >>>> >>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>>> >>>> -seg-module pbs works fine. >>>> GRAM audit with PostgreSQL works fine. >>>> >>>> Can someone submit jobs as a gateway user? I'd like to check if >>>> the gateway_user field is written to our audit database. >>>> >>>> Thanks, >>>> Lukasz >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 28 10:56:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Jul 2009 10:56:37 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <4A663FD8.3050909@mcs.anl.gov> Message-ID: <4A6F1FB5.2080207@mcs.anl.gov> Allan Espinosa will try to test AMPL workflows for the SEE project there this week. I may try a few others time permitting, but likely not this week. Questions, Stu: - do you want testing through Condor-G with the grid_monitor as well as native? - for native testing of GRAM5 (ie through the plain pre-WS GRAM interface) are then any guidelines for how many jobs we can safely submit at once, or should we not worry about limits? (ie sending a few thousand jobs is OK?) Allan: I just remembered that since Queenbee has 8-core hosts like Abe, coasters is the only reasonable approach for large-scale testing. But testing just a few AMPL jobs through plain GRAM5 seems a reasonable step to do first. I realize that coaster testing, also, wont give good CPU utilization until the current "low demand" problem is solved. - Mike On 7/28/09 9:26 AM, Stuart Martin wrote: > Hi Mike, > > Just following up on this. Will there be some swift use of GRAM5 on > queen bee this week? > > -Stu > > On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote: > >> Yes, there are a few we can run on QueenBee. >> >> Can try to test next week. >> >> Allan, we can test SEE/AMPL, OOPS, and PTMap there. >> >> - Mike >> >> >> On 7/21/09 10:58 AM, Stuart Martin wrote: >>> Are there any swift apps that can use queen bee? There is a GRAM5 >>> service setup there for testing. >>> -Stu >>> Begin forwarded message: >>>> From: Stuart Martin >>>> Date: July 21, 2009 10:56:04 AM CDT >>>> To: gateways at teragrid.org >>>> Cc: Stuart Martin , Lukasz Lacinski >>>> >>>> Subject: Fwd: [gram-user] GRAM5 Alpha2 >>>> >>>> Hi Gateways, >>>> >>>> Any gateways that use (or can use) Queen Bee, it would be great if >>>> you could target this new GRAM5 service that Lukasz deployed. I >>>> heard from Lukasz that Jim has submitted a gateway user (SAML) job >>>> and that went through fine and populate the gram audit DB >>>> correctly. Thanks Jim! It would be nice to have some gateway push >>>> the service to test scalability. >>>> >>>> Let us know if you plan to do this. >>>> >>>> Thanks, >>>> Stu >>>> >>>> Begin forwarded message: >>>> >>>>> From: Lukasz Lacinski >>>>> Date: July 21, 2009 1:18:05 AM CDT >>>>> To: gram-user at lists.globus.org >>>>> Subject: [gram-user] GRAM5 Alpha2 >>>>> >>>>> I've installed GRAM5 Alpha2 on Queen Bee. >>>>> >>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>>>> >>>>> -seg-module pbs works fine. >>>>> GRAM audit with PostgreSQL works fine. >>>>> >>>>> Can someone submit jobs as a gateway user? I'd like to check if the >>>>> gateway_user field is written to our audit database. >>>>> >>>>> Thanks, >>>>> Lukasz >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at anl.gov Tue Jul 28 11:09:01 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 28 Jul 2009 11:09:01 -0500 Subject: [Swift-devel] Functionality request: best effort execution In-Reply-To: References: Message-ID: I agree that this sort of thing would be of great value for some applications. Note that this would make provenance recording more interesting and important! (As you need to record what happened, not just the input arguments.) Ian. On Jul 14, 2009, at 2:09 AM, Ben Clifford wrote: > > One way of putting in ambiguity here is something like the AMB(iguous) > operator, which looks very similar to Karajan's race behaviour. > > a AMB b evaluates to either a or b but its not defined which and > so the > runtime can pick which. > > That has no particular preference for a result, though in Tibi's use > case > one of the results is probably preferred. > > You could change the semantics so that it returns a unless a fails in > which case it evaluates and returns b, unless b fails in which case > the > expression fails to evaluate. > > Both of the above descriptions can be extended to more than two > operands > in a natural way. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From smartin at mcs.anl.gov Tue Jul 28 11:17:23 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 28 Jul 2009 11:17:23 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: <4A6F1FB5.2080207@mcs.anl.gov> References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <4A663FD8.3050909@mcs.anl.gov> <4A6F1FB5.2080207@mcs.anl.gov> Message-ID: <6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov> On Jul 28, 2009, at Jul 28, 10:56 AM, Michael Wilde wrote: > Allan Espinosa will try to test AMPL workflows for the SEE project > there this week. > > I may try a few others time permitting, but likely not this week. > > Questions, Stu: > - do you want testing through Condor-G with the grid_monitor as well > as native? I'd say to use GRAM5 as is best for you/your users. We've done some condor-g testing with and without the grid-monitor. We did with, just for backward compatibility. But without is recommended. The grid- monitor is no longer needed with GRAM5. So, if you have users that use condor-g, then submit GRAM5 jobs with that. But, turn off using the grid-monitor. http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_7:_gram5- condor-g But if it is "better" to submit them natively, through cog API I assume(?), then do that. > - for native testing of GRAM5 (ie through the plain pre-WS GRAM > interface) are then any guidelines for how many jobs we can safely > submit at once, or should we not worry about limits? (ie sending a > few thousand jobs is OK?) Don't worry about it and submit away. We need to know the limits/ breaking points. But, to show what we've done in our testing, here are the results from our 5 client tests (each running in a separate VM) hitting the same GRAM5 service. http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_4:_5-client- seg_2 http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_5:_5-client- seg-diffusers_2 They submitted 5000 jobs over a 1 hour window to the same GRAM5 service. The load on the head node never went above 4 on the first and 7 on the second. > > Allan: I just remembered that since Queenbee has 8-core hosts like > Abe, coasters is the only reasonable approach for large-scale > testing. But testing just a few AMPL jobs through plain GRAM5 seems > a reasonable step to do first. > > I realize that coaster testing, also, wont give good CPU utilization > until the current "low demand" problem is solved. > > - Mike > > > On 7/28/09 9:26 AM, Stuart Martin wrote: >> Hi Mike, >> Just following up on this. Will there be some swift use of GRAM5 >> on queen bee this week? >> -Stu >> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote: >>> Yes, there are a few we can run on QueenBee. >>> >>> Can try to test next week. >>> >>> Allan, we can test SEE/AMPL, OOPS, and PTMap there. >>> >>> - Mike >>> >>> >>> On 7/21/09 10:58 AM, Stuart Martin wrote: >>>> Are there any swift apps that can use queen bee? There is a >>>> GRAM5 service setup there for testing. >>>> -Stu >>>> Begin forwarded message: >>>>> From: Stuart Martin >>>>> Date: July 21, 2009 10:56:04 AM CDT >>>>> To: gateways at teragrid.org >>>>> Cc: Stuart Martin , Lukasz Lacinski >>>> > >>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2 >>>>> >>>>> Hi Gateways, >>>>> >>>>> Any gateways that use (or can use) Queen Bee, it would be great >>>>> if you could target this new GRAM5 service that Lukasz >>>>> deployed. I heard from Lukasz that Jim has submitted a gateway >>>>> user (SAML) job and that went through fine and populate the gram >>>>> audit DB correctly. Thanks Jim! It would be nice to have some >>>>> gateway push the service to test scalability. >>>>> >>>>> Let us know if you plan to do this. >>>>> >>>>> Thanks, >>>>> Stu >>>>> >>>>> Begin forwarded message: >>>>> >>>>>> From: Lukasz Lacinski >>>>>> Date: July 21, 2009 1:18:05 AM CDT >>>>>> To: gram-user at lists.globus.org >>>>>> Subject: [gram-user] GRAM5 Alpha2 >>>>>> >>>>>> I've installed GRAM5 Alpha2 on Queen Bee. >>>>>> >>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>>>>> >>>>>> -seg-module pbs works fine. >>>>>> GRAM audit with PostgreSQL works fine. >>>>>> >>>>>> Can someone submit jobs as a gateway user? I'd like to check if >>>>>> the gateway_user field is written to our audit database. >>>>>> >>>>>> Thanks, >>>>>> Lukasz >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Tue Jul 28 16:06:48 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 28 Jul 2009 16:06:48 -0500 Subject: [Swift-devel] Re: coaster workers not receiving enough jobs In-Reply-To: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> Message-ID: <50b07b4b0907281406v238f95a0p52c58a65f98ef0af@mail.gmail.com> Now i tried changing 5mins to 5secs. based on the worker ids with the pull() calls it looks like all the jobs were successfully assigned to all workers throughout the blocks. [aespinosa at communicado coasters]$ grep pull coasters.log | grep -v Later | awk '{print $5}' | sort -u | cat -n | tail 65 0728-590322-000001:58 66 0728-590322-000001:59 67 0728-590322-000001:6 68 0728-590322-000001:60 69 0728-590322-000001:61 70 0728-590322-000001:62 71 0728-590322-000001:63 72 0728-590322-000001:7 73 0728-590322-000001:8 74 0728-590322-000001:9 My guess is that at long jobs (5 mins), pull() timeouts while waiting and will only get assigned much later on. But this doesn't happen because of some timeout mechanisms too (i think). 2009/7/23 Allan Espinosa : > I tried 0660-many.swift with 200 5min sleep jobs using local:local > mode (since queue on ranger and teraport takes a while to finish). > The session spawned 192 workers. ?Swift reports at most 36 active > processes at a time (which it finished successfully). ?After that > workers reach idle time exceptions. ? Logs and stuff are in > ~aespinosa/workflows/coaster_debug/run1/ > > sites.xml: > > > ? > ? ? > ? ? > ? ?>/home/aespinosa/workflows/coaster_debug/workdir > > ? ? ? ? ? ? ? ?10000 > ? ? ? ? ? ? ? ?1.98 > > ? ?1 > ? ?00:05:00 > ? ?3600 > ? > > > > swift session: > Swift svn swift-r3011 cog-r2439 > > RunID: locallog > Progress: > Progress: ?Selecting site:198 ?Initializing site shared directory:1 ?Stage in:1 > Progress: ?Selecting site:1 ?Submitting:198 ?Submitted:1 > Progress: ?Selecting site:1 ?Submitted:198 ?Active:1 > Progress: ?Selecting site:1 ?Submitted:192 ?Active:7 > Progress: ?Selecting site:1 ?Submitted:188 ?Active:11 > Progress: ?Selecting site:1 ?Submitted:181 ?Active:18 > Progress: ?Selecting site:1 ?Submitted:178 ?Active:21 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > > > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:36 > Progress: ?Selecting site:1 ?Submitted:163 ?Active:35 ?Checking status:1 > Progress: ?Submitted:156 ?Active:35 ?Checking status:1 ?Finished successfully:8 > Progress: ?Submitted:149 ?Active:34 ?Checking status:1 ?Finished successfully:16 > Progress: ?Submitted:144 ?Active:35 ?Checking status:1 ?Finished successfully:20 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:134 ?Active:30 ?Finished successfully:36 > Progress: ?Submitted:133 ?Active:31 ?Finished successfully:36 > Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost > Failed to transfer wrapper log from 066-many-locallog/info/l on localhost > Failed to transfer wrapper log from 066-many-locallog/info/k on localhost > Failed to transfer wrapper log from 066-many-locallog/info/n on localhost > Failed to transfer wrapper log from 066-many-locallog/info/o on localhost > Failed to transfer wrapper log from 066-many-locallog/info/q on localhost > ailed to transfer wrapper log from 066-many-locallog/info/c on localhost > Failed to transfer wrapper log from 066-many-locallog/info/m on localhost > Failed to transfer wrapper log from 066-many-locallog/info/i on localhost > Failed to transfer wrapper log from 066-many-locallog/info/p on localhost > Failed to transfer wrapper log from 066-many-locallog/info/a on localhost > Progress: ?Stage in:11 ?Submitting:34 ?Submitted:113 ?Active:6 > Finished successfully:36 > Progress: ?Submitted:157 ?Active:7 ?Finished successfully:36 > Failed to transfer wrapper log from 066-many-locallog/info/t on localhost > Failed to transfer wrapper log from 066-many-locallog/info/u on localhost > Failed to transfer wrapper log from 066-many-locallog/info/v on localhost > Failed to transfer wrapper log from 066-many-locallog/info/x on localhost > Failed to transfer wrapper log from 066-many-locallog/info/r on localhost > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > Progress: ?Submitted:163 ?Active:1 ?Finished successfully:36 > ... > ... (not yet finished) > > $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail > ? ?65 ?2009-07-23 11:08:10,065-0500 DEBUG TaskImpl > Task(type=JOB_SUBMISSION, > identity=urn:1248364974288-1248364979260-1248364979261) setting status > to Active workerid=000055 > ? ?66 ?2009-07-23 11:08:10,090-0500 DEBUG TaskImpl > Task(type=JOB_SUBMISSION, > identity=urn:1248364974280-1248364979248-1248364979249) setting status > to Active workerid=000051 > $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail > ? 61 ?worker-0723-021156-000001.log:1248365290 000054 < len=9, > actuallen=9, tag=1, flags=0, SUBMITJOB > ? ?62 ?worker-0723-021156-000001.log:1248365290 000050 < len=9, > actuallen=9, tag=1, flags=0, SUBMITJOB > ? ?63 ?worker-0723-021156-000001.log:1248365290 000053 < len=9, > actuallen=9, tag=1, flags=0, SUBMITJOB > ? ?64 ?worker-0723-021156-000001.log:1248365290 000052 < len=9, > actuallen=9, tag=1, flags=0, SUBMITJOB > ? ?65 ?worker-0723-021156-000001.log:1248365290 000051 < len=9, > actuallen=9, tag=1, flags=0, SUBMITJOB > ? ?66 ?worker-0723-021156-000001.log:1248365290 000055 < len=9, > actuallen=9, tag=1, flags=0, SUBMITJOB > > > it corresponds correctly with the swift session (more or less) since > we had 30+ completed jobs. > > Some lines in coasters.log i find intersting: > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1248364974290-1248364979263-1248364979264) setting status > to Submitted > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1248364974290-1248364979263-1248364979264) setting status > to Active > 2009-07-23 11:12:06,065-0500 INFO ?Command Sending Command(106, > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1) > 2009-07-23 11:12:06,065-0500 INFO ?Command Command(106, JOBSTATUS) > CMD: Command(106, JOBSTATUS) > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1248364974290-1248364979263-1248364979264) setting status > to Failed Block ta > sk failed: 0723-021156-000001Block task ended prematurely > > Statement unlikely to be reached at > /home/aespinosa/.globus/coasters/cscript15423.pl line 580. > ? ? ? ?(Maybe you meant system() when you said exec()?) > > > 2009-07-23 11:12:06,065-0500 INFO ?Command Sending Command(107, > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1) > 2009-07-23 11:12:06,065-0500 INFO ?Command Command(107, JOBSTATUS) > CMD: Command(107, JOBSTATUS) > > > -Allan > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Tue Jul 28 16:12:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Jul 2009 16:12:22 -0500 Subject: [Swift-devel] Re: coaster workers not receiving enough jobs In-Reply-To: <50b07b4b0907281406v238f95a0p52c58a65f98ef0af@mail.gmail.com> References: <50b07b4b0907230940i54b29c88hbf96a6774eae9b40@mail.gmail.com> <50b07b4b0907281406v238f95a0p52c58a65f98ef0af@mail.gmail.com> Message-ID: <1248815542.4106.2.camel@localhost> Right. I put in some quick code to prevent another spin. That likely has some bugs that causes this problem. On Tue, 2009-07-28 at 16:06 -0500, Allan Espinosa wrote: > Now i tried changing 5mins to 5secs. based on the worker ids with the > pull() calls it looks like all the jobs were successfully assigned to > all workers throughout the blocks. > > [aespinosa at communicado coasters]$ grep pull coasters.log | grep -v > Later | awk '{print $5}' | sort -u | cat -n | tail > 65 0728-590322-000001:58 > 66 0728-590322-000001:59 > 67 0728-590322-000001:6 > 68 0728-590322-000001:60 > 69 0728-590322-000001:61 > 70 0728-590322-000001:62 > 71 0728-590322-000001:63 > 72 0728-590322-000001:7 > 73 0728-590322-000001:8 > 74 0728-590322-000001:9 > > > My guess is that at long jobs (5 mins), pull() timeouts while waiting > and will only get assigned much later on. But this doesn't happen > because of some timeout mechanisms too (i think). > > 2009/7/23 Allan Espinosa : > > I tried 0660-many.swift with 200 5min sleep jobs using local:local > > mode (since queue on ranger and teraport takes a while to finish). > > The session spawned 192 workers. Swift reports at most 36 active > > processes at a time (which it finished successfully). After that > > workers reach idle time exceptions. Logs and stuff are in > > ~aespinosa/workflows/coaster_debug/run1/ > > > > sites.xml: > > > > > > > > > > > > >>/home/aespinosa/workflows/coaster_debug/workdir > > > > 10000 > > 1.98 > > > > 1 > > 00:05:00 > > 3600 > > > > > > > > > > swift session: > > Swift svn swift-r3011 cog-r2439 > > > > RunID: locallog > > Progress: > > Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1 > > Progress: Selecting site:1 Submitting:198 Submitted:1 > > Progress: Selecting site:1 Submitted:198 Active:1 > > Progress: Selecting site:1 Submitted:192 Active:7 > > Progress: Selecting site:1 Submitted:188 Active:11 > > Progress: Selecting site:1 Submitted:181 Active:18 > > Progress: Selecting site:1 Submitted:178 Active:21 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > > > > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:36 > > Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1 > > Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8 > > Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16 > > Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:134 Active:30 Finished successfully:36 > > Progress: Submitted:133 Active:31 Finished successfully:36 > > Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/l on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/k on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/n on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/o on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/q on localhost > > ailed to transfer wrapper log from 066-many-locallog/info/c on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/m on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/i on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/p on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/a on localhost > > Progress: Stage in:11 Submitting:34 Submitted:113 Active:6 > > Finished successfully:36 > > Progress: Submitted:157 Active:7 Finished successfully:36 > > Failed to transfer wrapper log from 066-many-locallog/info/t on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/u on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/v on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/x on localhost > > Failed to transfer wrapper log from 066-many-locallog/info/r on localhost > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > Progress: Submitted:163 Active:1 Finished successfully:36 > > ... > > ... (not yet finished) > > > > $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail > > 65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl > > Task(type=JOB_SUBMISSION, > > identity=urn:1248364974288-1248364979260-1248364979261) setting status > > to Active workerid=000055 > > 66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl > > Task(type=JOB_SUBMISSION, > > identity=urn:1248364974280-1248364979248-1248364979249) setting status > > to Active workerid=000051 > > $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail > > 61 worker-0723-021156-000001.log:1248365290 000054 < len=9, > > actuallen=9, tag=1, flags=0, SUBMITJOB > > 62 worker-0723-021156-000001.log:1248365290 000050 < len=9, > > actuallen=9, tag=1, flags=0, SUBMITJOB > > 63 worker-0723-021156-000001.log:1248365290 000053 < len=9, > > actuallen=9, tag=1, flags=0, SUBMITJOB > > 64 worker-0723-021156-000001.log:1248365290 000052 < len=9, > > actuallen=9, tag=1, flags=0, SUBMITJOB > > 65 worker-0723-021156-000001.log:1248365290 000051 < len=9, > > actuallen=9, tag=1, flags=0, SUBMITJOB > > 66 worker-0723-021156-000001.log:1248365290 000055 < len=9, > > actuallen=9, tag=1, flags=0, SUBMITJOB > > > > > > it corresponds correctly with the swift session (more or less) since > > we had 30+ completed jobs. > > > > Some lines in coasters.log i find intersting: > > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1248364974290-1248364979263-1248364979264) setting status > > to Submitted > > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1248364974290-1248364979263-1248364979264) setting status > > to Active > > 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106, > > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1) > > 2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS) > > CMD: Command(106, JOBSTATUS) > > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1248364974290-1248364979263-1248364979264) setting status > > to Failed Block ta > > sk failed: 0723-021156-000001Block task ended prematurely > > > > Statement unlikely to be reached at > > /home/aespinosa/.globus/coasters/cscript15423.pl line 580. > > (Maybe you meant system() when you said exec()?) > > > > > > 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107, > > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1) > > 2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS) > > CMD: Command(107, JOBSTATUS) > > > > > > -Allan > > > > > From wilde at mcs.anl.gov Tue Jul 28 19:26:15 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Jul 2009 19:26:15 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: <6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov> References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <4A663FD8.3050909@mcs.anl.gov> <4A6F1FB5.2080207@mcs.anl.gov> <6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov> Message-ID: <4A6F9727.9050300@mcs.anl.gov> Stu, Glen Hocky has started testing a protein folding app called "OOPS" on QueenBee under GRAM5. Initial tiny sanity tests look good; we'll move on to running 100+ job runs, then larger. We needed to figure out how to get Swift to use all 8 cores of the QueenBee compute nodes, which we did. Now we can start scaling up. Glen hopes to test more there shortly. So far, no problems; no observed differences (in interface) with the new GRAM. Any chance of getting GRAM5 on the firefly host at UNL? - Mike On 7/28/09 11:17 AM, Stuart Martin wrote: > On Jul 28, 2009, at Jul 28, 10:56 AM, Michael Wilde wrote: > >> Allan Espinosa will try to test AMPL workflows for the SEE project >> there this week. >> >> I may try a few others time permitting, but likely not this week. >> >> Questions, Stu: >> - do you want testing through Condor-G with the grid_monitor as well >> as native? > > I'd say to use GRAM5 as is best for you/your users. We've done some > condor-g testing with and without the grid-monitor. We did with, just > for backward compatibility. But without is recommended. The > grid-monitor is no longer needed with GRAM5. > > So, if you have users that use condor-g, then submit GRAM5 jobs with > that. But, turn off using the grid-monitor. > http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_7:_gram5-condor-g > > > But if it is "better" to submit them natively, through cog API I > assume(?), then do that. > >> - for native testing of GRAM5 (ie through the plain pre-WS GRAM >> interface) are then any guidelines for how many jobs we can safely >> submit at once, or should we not worry about limits? (ie sending a few >> thousand jobs is OK?) > > Don't worry about it and submit away. We need to know the > limits/breaking points. > > But, to show what we've done in our testing, here are the results from > our 5 client tests (each running in a separate VM) hitting the same > GRAM5 service. > http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_4:_5-client-seg_2 > > http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_5:_5-client-seg-diffusers_2 > > > They submitted 5000 jobs over a 1 hour window to the same GRAM5 > service. The load on the head node never went above 4 on the first and > 7 on the second. > >> >> Allan: I just remembered that since Queenbee has 8-core hosts like >> Abe, coasters is the only reasonable approach for large-scale testing. >> But testing just a few AMPL jobs through plain GRAM5 seems a >> reasonable step to do first. >> >> I realize that coaster testing, also, wont give good CPU utilization >> until the current "low demand" problem is solved. >> >> - Mike >> >> >> On 7/28/09 9:26 AM, Stuart Martin wrote: >>> Hi Mike, >>> Just following up on this. Will there be some swift use of GRAM5 on >>> queen bee this week? >>> -Stu >>> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote: >>>> Yes, there are a few we can run on QueenBee. >>>> >>>> Can try to test next week. >>>> >>>> Allan, we can test SEE/AMPL, OOPS, and PTMap there. >>>> >>>> - Mike >>>> >>>> >>>> On 7/21/09 10:58 AM, Stuart Martin wrote: >>>>> Are there any swift apps that can use queen bee? There is a GRAM5 >>>>> service setup there for testing. >>>>> -Stu >>>>> Begin forwarded message: >>>>>> From: Stuart Martin >>>>>> Date: July 21, 2009 10:56:04 AM CDT >>>>>> To: gateways at teragrid.org >>>>>> Cc: Stuart Martin , Lukasz Lacinski >>>>>> >>>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2 >>>>>> >>>>>> Hi Gateways, >>>>>> >>>>>> Any gateways that use (or can use) Queen Bee, it would be great if >>>>>> you could target this new GRAM5 service that Lukasz deployed. I >>>>>> heard from Lukasz that Jim has submitted a gateway user (SAML) job >>>>>> and that went through fine and populate the gram audit DB >>>>>> correctly. Thanks Jim! It would be nice to have some gateway >>>>>> push the service to test scalability. >>>>>> >>>>>> Let us know if you plan to do this. >>>>>> >>>>>> Thanks, >>>>>> Stu >>>>>> >>>>>> Begin forwarded message: >>>>>> >>>>>>> From: Lukasz Lacinski >>>>>>> Date: July 21, 2009 1:18:05 AM CDT >>>>>>> To: gram-user at lists.globus.org >>>>>>> Subject: [gram-user] GRAM5 Alpha2 >>>>>>> >>>>>>> I've installed GRAM5 Alpha2 on Queen Bee. >>>>>>> >>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>>>>>> >>>>>>> -seg-module pbs works fine. >>>>>>> GRAM audit with PostgreSQL works fine. >>>>>>> >>>>>>> Can someone submit jobs as a gateway user? I'd like to check if >>>>>>> the gateway_user field is written to our audit database. >>>>>>> >>>>>>> Thanks, >>>>>>> Lukasz >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Jul 28 19:47:50 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Jul 2009 19:47:50 -0500 Subject: [Swift-devel] Running on multicore hosts Message-ID: <4A6F9C36.4090209@mcs.anl.gov> Tibi, You should be able to do some preliminary tests of your econ app on QueenBee using GRAM5. The GRAM contact URIs Stu posted were: queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs To use all 8 cores of the hosts, turn on Swift clustering. Then edit libexec/_swiftseq to run all the jobs in a cluster in parallel rather than serially. 1) add an & to the line where the jobs are exec'ed: "$EXEC" "${ARGS[@]}" & 2) add a wait at the end of the script: done wait echo `date +%s` DONE >> $WRAPPERLOG Then turn on clustering. You need to do the math to get a fixed cluster size of NCPUs, 8 for QueenBee and Abe. 16 for Ranger. For oops we used: clustering.enabled=true clustering.min.time=480 clustering.queue.delay=15 with a GLOBUS::maxwalltime="00:01:00" This gave clusters of 480/60 = 8, and PBS walltimes of 8 minutes. To note: - the site maxwalltime was ignored; Swift calculated the PBS maxwalltime form the cluster size it built. - contrary to the user guide, Swift seemed to use clustering.min.time/(tc.data time) rather than (2*clustering.min.time)/(tc.data time) That needs investigation; it may be a matter of interpretation or may be describing a case where more jobs could enter the cluster queue before Swift has a chance to close the cluster. - When we are more sure this works, we can commit a reference file _swiftpar to the libexec directory. - at the moment the simple hack punts on per-job error code return with the cluster. The sequential cluster script passes on the error code of the first job in the cluster to fail, and aborts the rest of the cluster. The heck above treats the cluster as if all jobs succeeded. Im not sure if the per-job error codes make it back via _swiftwrap. if not, they could be made to. In any case, this is at the moment a temporary but simple hack to use sites with multicore nodes, while coasters is being debugged. It could readily be generalized though into straightforward direct support for multicore hosts over GRAM5, PBS, or Condor-G. - Mike From smartin at mcs.anl.gov Wed Jul 29 10:25:54 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Wed, 29 Jul 2009 10:25:54 -0500 Subject: [Swift-devel] Fwd: [gram-user] GRAM5 Alpha2 In-Reply-To: <4A6F9727.9050300@mcs.anl.gov> References: <77AD1D14-008E-4FE8-A120-1D9E2C8C64E4@mcs.anl.gov> <4A663FD8.3050909@mcs.anl.gov> <4A6F1FB5.2080207@mcs.anl.gov> <6701C4A1-FC0D-4DA4-9972-2F9CEC8EF11D@mcs.anl.gov> <4A6F9727.9050300@mcs.anl.gov> Message-ID: <367326D5-0888-4F1E-B0CB-5979EB4A3B0E@mcs.anl.gov> On Jul 28, 2009, at Jul 28, 7:26 PM, Michael Wilde wrote: > Stu, > > Glen Hocky has started testing a protein folding app called "OOPS" > on QueenBee under GRAM5. Initial tiny sanity tests look good; we'll > move on to running 100+ job runs, then larger. > > We needed to figure out how to get Swift to use all 8 cores of the > QueenBee compute nodes, which we did. > > Now we can start scaling up. Glen hopes to test more there shortly. > > So far, no problems; no observed differences (in interface) with the > new GRAM. Cool. Let's see how things go as you ramp up. I want to keep track of GRAM5 application use cases and test results as I get them. I took a stab at what I think is happening for this OOPS application. I'm not sure if it is accurate, please take a look. http://dev.globus.org/wiki/GRAM/GRAM5#Application_Testing Then I'll need the details of one of the larger test runs that is done. > > Any chance of getting GRAM5 on the firefly host at UNL? Yea, I (we) can ask. But, maybe it makes sense to get some success with using gram5 on queen bee first and then we ask Brian and say look, we'd like/need gram5 installed there for testing? Looks like Firefly is running moab, so it would use the gram pbs adapter like queen bee. There is a CMS OSG effort going on now with Igor and Jeff Dost. But regardless, the more testing/deployments the better. > - Mike > > > On 7/28/09 11:17 AM, Stuart Martin wrote: >> On Jul 28, 2009, at Jul 28, 10:56 AM, Michael Wilde wrote: >>> Allan Espinosa will try to test AMPL workflows for the SEE project >>> there this week. >>> >>> I may try a few others time permitting, but likely not this week. >>> >>> Questions, Stu: >>> - do you want testing through Condor-G with the grid_monitor as >>> well as native? >> I'd say to use GRAM5 as is best for you/your users. We've done >> some condor-g testing with and without the grid-monitor. We did >> with, just for backward compatibility. But without is >> recommended. The grid-monitor is no longer needed with GRAM5. >> So, if you have users that use condor-g, then submit GRAM5 jobs >> with that. But, turn off using the grid-monitor. >> http://dev.globus.org/wiki/ >> GRAM5_Scalability_Results#Test_7:_gram5-condor-g But if it is >> "better" to submit them natively, through cog API I assume(?), then >> do that. >>> - for native testing of GRAM5 (ie through the plain pre-WS GRAM >>> interface) are then any guidelines for how many jobs we can safely >>> submit at once, or should we not worry about limits? (ie sending a >>> few thousand jobs is OK?) >> Don't worry about it and submit away. We need to know the limits/ >> breaking points. >> But, to show what we've done in our testing, here are the results >> from our 5 client tests (each running in a separate VM) hitting the >> same GRAM5 service. >> http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_4:_5- >> client-seg_2 http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_5 >> :_5-client-seg-diffusers_2 They submitted 5000 jobs over a 1 hour >> window to the same GRAM5 service. The load on the head node never >> went above 4 on the first and 7 on the second. >>> >>> Allan: I just remembered that since Queenbee has 8-core hosts like >>> Abe, coasters is the only reasonable approach for large-scale >>> testing. But testing just a few AMPL jobs through plain GRAM5 >>> seems a reasonable step to do first. >>> >>> I realize that coaster testing, also, wont give good CPU >>> utilization until the current "low demand" problem is solved. >>> >>> - Mike >>> >>> >>> On 7/28/09 9:26 AM, Stuart Martin wrote: >>>> Hi Mike, >>>> Just following up on this. Will there be some swift use of GRAM5 >>>> on queen bee this week? >>>> -Stu >>>> On Jul 21, 2009, at Jul 21, 5:23 PM, Michael Wilde wrote: >>>>> Yes, there are a few we can run on QueenBee. >>>>> >>>>> Can try to test next week. >>>>> >>>>> Allan, we can test SEE/AMPL, OOPS, and PTMap there. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> On 7/21/09 10:58 AM, Stuart Martin wrote: >>>>>> Are there any swift apps that can use queen bee? There is a >>>>>> GRAM5 service setup there for testing. >>>>>> -Stu >>>>>> Begin forwarded message: >>>>>>> From: Stuart Martin >>>>>>> Date: July 21, 2009 10:56:04 AM CDT >>>>>>> To: gateways at teragrid.org >>>>>>> Cc: Stuart Martin , Lukasz Lacinski >>>>>> > >>>>>>> Subject: Fwd: [gram-user] GRAM5 Alpha2 >>>>>>> >>>>>>> Hi Gateways, >>>>>>> >>>>>>> Any gateways that use (or can use) Queen Bee, it would be >>>>>>> great if you could target this new GRAM5 service that Lukasz >>>>>>> deployed. I heard from Lukasz that Jim has submitted a >>>>>>> gateway user (SAML) job and that went through fine and >>>>>>> populate the gram audit DB correctly. Thanks Jim! It would >>>>>>> be nice to have some gateway push the service to test >>>>>>> scalability. >>>>>>> >>>>>>> Let us know if you plan to do this. >>>>>>> >>>>>>> Thanks, >>>>>>> Stu >>>>>>> >>>>>>> Begin forwarded message: >>>>>>> >>>>>>>> From: Lukasz Lacinski >>>>>>>> Date: July 21, 2009 1:18:05 AM CDT >>>>>>>> To: gram-user at lists.globus.org >>>>>>>> Subject: [gram-user] GRAM5 Alpha2 >>>>>>>> >>>>>>>> I've installed GRAM5 Alpha2 on Queen Bee. >>>>>>>> >>>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >>>>>>>> queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >>>>>>>> >>>>>>>> -seg-module pbs works fine. >>>>>>>> GRAM audit with PostgreSQL works fine. >>>>>>>> >>>>>>>> Can someone submit jobs as a gateway user? I'd like to check >>>>>>>> if the gateway_user field is written to our audit database. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Lukasz >>>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From jamalphd at gmail.com Sun Jul 26 14:50:09 2009 From: jamalphd at gmail.com (J A) Date: Sun, 26 Jul 2009 19:50:09 -0000 Subject: [Swift-devel] XDTM Message-ID: Hi All: Can any one direct me to a source with more examples/explanation on how XDTM is working/implemented? Thanks, Jamal -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Jul 27 11:27:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 27 Jul 2009 16:27:51 -0000 Subject: [Swift-devel] [provenance-challenge] FGCS Special Issue on Using the Open Provenance Model to Address Interoperability Challenges (fwd) Message-ID: This went to the provenance challenge list - maybe someone is interested. ---------- Forwarded message ---------- Date: Mon, 20 Jul 2009 20:06:01 +0000 From: Yogesh Simmhan Reply-To: provenance-challenge at ipaw.info To: "provenance-challenge at ipaw.info" Subject: [provenance-challenge] FGCS Special Issue on Using the Open Provenance Model to Address Interoperability Challenges This is the CfP for the special issue on OPM we discussed at the PC3 workshop. The special issue will appear in J. FGCS. Please consider submitting articles to the issue and also forward the CfP to groups/people who may be interested. PDF/TXT/HTML versions are attached. Regards, --Yogesh ________________________________________________________ Yogesh Simmhan/Post Doc Researcher/eScience Group/Microsoft Research EMail: yoges at microsoft.com WWW: research.microsoft.com/~yoges Office (LA): 1100 Glendon Ave/Suite 1080, Los Angeles CA 90024 Cell: +1 (540) 449-4770 SF Desk/Fax: +1 (425) 538-6245 -------------- next part -------------- A non-text attachment was scrubbed... Name: FGCS-OPM-CfP.PDF Type: application/pdf Size: 56343 bytes Desc: FGCS-OPM-CfP.PDF URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: FGCS-OPM-CfP.txt URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Jul 28 08:35:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Jul 2009 13:35:19 -0000 Subject: [Swift-devel] Re: More questions on Provenance In-Reply-To: <4A6DEEDF.6050603@purdue.edu> References: <4A6DEEDF.6050603@purdue.edu> Message-ID: Hi Tanu. I'm long gone. But here are a few brief comments. I added swift-devel. On Mon, 27 Jul 2009, Tanu Malik wrote: > 1. How do you model the provenance for across the network transfers? > In that case the input is some file, the process is the file transfer process > and the > output would be on another machine. The output will have to be created > manually > which either mentions the success of the transfer or failure. The level at which provenance is recorded is more abstract than that at the level where file transfers exist. A procedure takes input files which are described by URLs relative to the submit-side run directory and produces output files described by the same. The internal mechanisms of moving those files around to runtime sites as needed and managing the cache of those happens internally to the procedure execution and is not exposed as explicit activity. Information is logged abut such transfers though so if desired it might be possible to make another level of description about what happened there (one of the interesting things with ongoing OPM work is how to describe the same activity at multiple levels like this). > 2. Also you mention something about the number of runs in your > presentation. "extra records ? depth of graph x number of runs". What > does the number of runs correspond to and how is that modeled in the DB. This is about constructing an explicit transitive closure of the procedure/dataset graph. If you have an explicit graph A->B, B->C then constructing the closure means you ened to add A->C as an edge. Thats what I mean by roughly proportional to depth of graph - the deeper the graph, the more edges you need to add. In the most recent implementation, each invocation of Swift is a subgraph disconnected from the subgraphs of all other invocations of Swift. So (if you make the often invalid but also often valid assumption that each invocation of Swift generates roughly the same size provenance output), size of the graph put together is roughly proportional to the number of runs. If further work was done to identify datasets from the graphs of different runs (using some identity relation such as same filename or something else), then generating a tranistive closure would possibly generate graphs that are proportional to more-than-the-number-of-runs. > I was also wondering if we can chat on the phone or I come up again to > discuss a possible collaboration on this project and present some of our > new results. Nothing involving me except by very occasional email or if you hunt me down in person and ply me with alcohol. -- From tmalik at purdue.edu Tue Jul 28 11:12:07 2009 From: tmalik at purdue.edu (Tanu Malik) Date: Tue, 28 Jul 2009 16:12:07 -0000 Subject: [Swift-devel] Re: More questions on Provenance In-Reply-To: References: <4A6DEEDF.6050603@purdue.edu> Message-ID: <4A6F201F.3010205@purdue.edu> Thanks Ben, This is very helpful. I wish I could hunt you down. Interesting to know about the recent OPM work. We have defined network nodes in our model to explicitly demonstrate those. I did not know about OPM. Thanks Ben Clifford wrote: > Hi Tanu. I'm long gone. But here are a few brief comments. I added > swift-devel. > > On Mon, 27 Jul 2009, Tanu Malik wrote: > > >> 1. How do you model the provenance for across the network transfers? >> In that case the input is some file, the process is the file transfer process >> and the >> output would be on another machine. The output will have to be created >> manually >> which either mentions the success of the transfer or failure. >> > > The level at which provenance is recorded is more abstract than that at > the level where file transfers exist. A procedure takes input files which > are described by URLs relative to the submit-side run directory and > produces output files described by the same. > > The internal mechanisms of moving those files around to runtime sites as > needed and managing the cache of those happens internally to the procedure > execution and is not exposed as explicit activity. > > Information is logged abut such transfers though so if desired it might be > possible to make another level of description about what happened there > (one of the interesting things with ongoing OPM work is how to describe > the same activity at multiple levels like this). > > >> 2. Also you mention something about the number of runs in your >> presentation. "extra records ? depth of graph x number of runs". What >> does the number of runs correspond to and how is that modeled in the DB. >> > > This is about constructing an explicit transitive closure of the > procedure/dataset graph. > > If you have an explicit graph A->B, B->C then constructing the closure > means you ened to add A->C as an edge. Thats what I mean by roughly > proportional to depth of graph - the deeper the graph, the more edges you > need to add. > > In the most recent implementation, each invocation of Swift is a subgraph > disconnected from the subgraphs of all other invocations of Swift. So (if > you make the often invalid but also often valid assumption that each > invocation of Swift generates roughly the same size provenance output), > size of the graph put together is roughly proportional to the number of > runs. > > If further work was done to identify datasets from the graphs of different > runs (using some identity relation such as same filename or something > else), then generating a tranistive closure would possibly generate graphs > that are proportional to more-than-the-number-of-runs. > > >> I was also wondering if we can chat on the phone or I come up again to >> discuss a possible collaboration on this project and present some of our >> new results. >> > > Nothing involving me except by very occasional email or if you hunt me > down in person and ply me with alcohol. > > --