[Swift-user] swift on midway: apps and modules
Neil Best
nbest at ci.uchicago.edu
Fri Nov 30 13:08:55 CST 2012
It seems like everything is slowing down:
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Thu Nov 29 21:30:34 CST 2012
8061
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Thu Nov 29 21:48:01 CST 2012
12680
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Thu Nov 29 22:01:36 CST 2012
16288
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Fri Nov 30 10:38:56 CST 2012
19238
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Fri Nov 30 12:58:04 CST 2012
19238
[nbest at midway-login1 narr]$ date;find data/nc -type f | wc -l
Fri Nov 30 12:58:45 CST 2012
18419
[nbest at midway-login1 narr]$ pwd
/project/joshuaelliott/narr
Too bad I didn't watch the nc/ directory as well, but both numbers
should be climbing to ~98k.
The swift process on the head node seems to have stalled:
Progress: time: Fri, 30 Nov 2012 04:24:02 +0000 Stage in:448
Submitted:9 Stage out:1206 Finished successfully:37070 Failed but
can retry:1
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Progress ticker"
Exception in thread "PullThread" java.lang.OutOfMemoryError: Java heap space
Exception in thread "Timer-1"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Scheduler"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Timer-1"
Exception in thread "pool-1-thread-41" java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Progress: time: Fri, 30 Nov 2012 04:25:03 +0000 Stage in:447
Submitted:9 Stage out:1206 Finished successfully:37070 Failed but
can retry:2
Progress: time: Fri, 30 Nov 2012 04:25:05 +0000 Stage in:446
Submitting:1 Submitted:9 Stage out:1206 Finished successfully:37070
Failed but can retry
:2
Progress: time: Fri, 30 Nov 2012 04:25:07 +0000 Stage in:445
Submitting:2 Submitted:9 Stage out:1206 Finished successfully:37070
Failed but can retry
:2
Progress: time: Fri, 30 Nov 2012 04:25:09 +0000 Stage in:444
Submitting:3 Submitted:9 Stage out:1206 Finished successfully:37070
Failed but can retry
:2
No output since then. Do I need to restart it? What do you think is happening?
Scrolling back in the output I also see this:
Progress: time: Fri, 30 Nov 2012 04:19:15 +0000 Initializing:1
Stage in:446 Submitting:1 Stage out:1213 Finished
successfully:37063 Failed but can re
try:1
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid25355.hprof ...
Heap dump file created [1172799498 bytes in 10.727 secs]
Progress: time: Fri, 30 Nov 2012 04:19:35 +0000Exception in thread
"PBS provider queue poller" Initializing:1 Stage in:446
Submitting:1 Stage out:1213
Finished successfully:37063 Failed but can retry:1
Progress: time: Fri, 30 Nov 2012 04:19:41 +0000 Initializing:1
Stage in:446 Submitted:1 Stage out:1213 Finished successfully:37063
Failed but can ret
ry:1
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedReader.<init>(BufferedReader.java:98)
at java.io.BufferedReader.<init>(BufferedReader.java:109)
at org.globus.cog.abstraction.impl.scheduler.slurm.QueuePoller.processStdout(QueuePoller.java:75)
. . .
Exception in thread "Timer-2" Exception in thread "Overloaded Host
Monitor" java.lang.OutOfMemoryError: Java heap space
Progress: time: Fri, 30 Nov 2012 04:20:19 +0000 Initializing:1
Selecting site:1 Stage in:446 Submitted:1 Stage out:1213 Finished
successfully:37063
Failed but can retry:1
Fri, 30 Nov 2012 04:20:42 +0000 Initializing:1 Selecting site:1
Stage in:446 Submitted:1 Stage out:1213 Finished successfully:37063
Failed but can re
try:1
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.newValueIterator(HashMap.java:856)
at java.util.HashMap$Values.iterator(HashMap.java:923)
at java.util.AbstractCollection.toArray(AbstractCollection.java:137)
at java.util.ArrayList.addAll(ArrayList.java:530)
at org.globus.cog.karajan.workflow.service.channels.ChannelContext.getActiveCommands(ChannelContext.java:171)
at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:124)
at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:118)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Ominous.
If I am restarting are there special steps that I need to take to
avoid redoing work already done?
Originally I thought the slow-down was due to fair-share or increased
activity on the cluster. Even though I am not so sure about those
theories now, can you point me to a primer on how to monitor the
cluster? I am new to Slurm. So far I have found the RCC documents to
only cover the broadest generalities but maybe I have overlooked
something. Thanks.
On Thu, Nov 29, 2012 at 10:02 PM, Neil Best <nbest at ci.uchicago.edu> wrote:
> On Thu, Nov 29, 2012 at 7:15 PM, David Kelly <davidk at ci.uchicago.edu> wrote:
>> file nctable<"nc_table">;
>
> It's cruising now, David. Thanks for the tip.
More information about the Swift-user
mailing list