[Swift-user] Large Job not starting

Lorenzo Pesce lpesce at uchicago.edu
Fri Jan 11 08:59:53 CST 2013


Hi,

We are working on a project which involves about 3 million tasks. We have run through 1,5 million tasks and we were resuming the job.

I have been seeing this for a while:

Progress:  time: Thu, 10 Jan 2013 20:39:20 +0000  Selecting site:63831  Submitted:7171  Finished in previous run:1486037
...
Progress:  time: Fri, 11 Jan 2013 14:50:21 +0000  Selecting site:63831  Submitted:7171  Finished in previous run:1486037

from the ps command:
lpesce   28172 28102 19 Jan10 pts/4    04:20:32 java -Xmx12072M -XX:+HeapDumpOnOutOfMemoryError -Djava.endorsed.dirs=/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed -DUID=1978 -DGLOBUS_HOSTNAME=login5.beagle.ci.uchicago.edu -DCOG_INSTALL_PATH=/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/.. -Dswift.home=/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/.. -Duser.home=/lustre/beagle/lpesce -Djava.security.egd=file:///dev/urandom -XX:+UseParallelGC -XX:ParallelGCThreads=2 -classpath /home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../etc:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../libexec:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/ant.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/castor-0.9.6.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-dcache-0.1.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/commons-httpclient.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/commons-logging-1.1.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cryptix32.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cryptix-asn1.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/cryptix.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/j2ssh-common-0.2.2.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/j2ssh-core-0.2.2-patch-b.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jakarta-regexp-1.2.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jaxrpc.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jce-jdk13-131.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jgss.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jline-0.9.94.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jsr173_1.0_api.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/jug-lgpl-2.0.0.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/junit.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/log4j-1.2.16.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/puretls.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/resolver.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift-svn/bin/../lib/stringtemplate.jar:/home/davidk/swift-trunk/cog/modules/swift/dist/swift


lpesce at login5:/lustre/beagle/GCNet/RG/Oreo/o080522_BS1> ps v 28172
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
28172 pts/4    Sl+  260:32     84     2 12868101 11612816 70.2 java -Xmx12072M -XX:+HeapDumpOnOutOfMemoryError 

Job seems to be using zero cpu at this time.

It has no jobs in the queue

lpesce at login5:/lustre/beagle/GCNet/RG/Oreo/o080522_BS1> qstat -u lpesce

sdb: 
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
1919801.sdb          lpesce   advanced B0109-030545-00    6144   --   --    --  117:4 R 45:54
1919802.sdb          lpesce   advanced B0109-030545-00    2868   --   --    --  117:4 R 45:53
1919806.sdb          lpesce   advanced B0109-080540-00   27222   --   --    --  117:3 R 45:49
1919807.sdb          lpesce   advanced B0109-080540-00    6609   --   --    --  117:3 R 45:49
1919808.sdb          lpesce   advanced B0109-080540-00    3328   --   --    --  117:3 R 45:48

(Unrelted jobs, which have been running for more than a day)

Suggestions?

Thanks,

Lorenzo




More information about the Swift-user mailing list