[Swift-user] swift on midway: apps and modules

Fri Nov 30 13:08:55 CST 2012

It seems like everything is slowing down:

[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Thu Nov 29 21:30:34 CST 2012
8061
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Thu Nov 29 21:48:01 CST 2012
12680
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Thu Nov 29 22:01:36 CST 2012
16288
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Fri Nov 30 10:38:56 CST 2012
19238
[nbest at midway-login1 narr]$ date;find data/grb2 -type f | wc -l
Fri Nov 30 12:58:04 CST 2012
19238
[nbest at midway-login1 narr]$ date;find data/nc -type f | wc -l
Fri Nov 30 12:58:45 CST 2012
18419
[nbest at midway-login1 narr]$ pwd
/project/joshuaelliott/narr

Too bad I didn't watch the nc/ directory as well, but both numbers
should be climbing to ~98k.

The swift process on the head node seems to have stalled:

Progress:  time: Fri, 30 Nov 2012 04:24:02 +0000  Stage in:448
Submitted:9  Stage out:1206  Finished successfully:37070  Failed but
can retry:1

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Progress ticker"
Exception in thread "PullThread" java.lang.OutOfMemoryError: Java heap space
Exception in thread "Timer-1"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Scheduler"

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Timer-1"
Exception in thread "pool-1-thread-41" java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Progress:  time: Fri, 30 Nov 2012 04:25:03 +0000  Stage in:447
Submitted:9  Stage out:1206  Finished successfully:37070  Failed but
can retry:2
Progress:  time: Fri, 30 Nov 2012 04:25:05 +0000  Stage in:446
Submitting:1  Submitted:9  Stage out:1206  Finished successfully:37070
 Failed but can retry
:2
Progress:  time: Fri, 30 Nov 2012 04:25:07 +0000  Stage in:445
Submitting:2  Submitted:9  Stage out:1206  Finished successfully:37070
 Failed but can retry
:2
Progress:  time: Fri, 30 Nov 2012 04:25:09 +0000  Stage in:444
Submitting:3  Submitted:9  Stage out:1206  Finished successfully:37070
 Failed but can retry
:2

No output since then.  Do I need to restart it?  What do you think is happening?

Scrolling back in the output I also see this:

Progress:  time: Fri, 30 Nov 2012 04:19:15 +0000  Initializing:1
Stage in:446  Submitting:1  Stage out:1213  Finished
successfully:37063  Failed but can re
try:1
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid25355.hprof ...
Heap dump file created [1172799498 bytes in 10.727 secs]
Progress:  time: Fri, 30 Nov 2012 04:19:35 +0000Exception in thread
"PBS provider queue poller"   Initializing:1  Stage in:446
Submitting:1  Stage out:1213
  Finished successfully:37063  Failed but can retry:1
Progress:  time: Fri, 30 Nov 2012 04:19:41 +0000  Initializing:1
Stage in:446  Submitted:1  Stage out:1213  Finished successfully:37063
 Failed but can ret
ry:1
java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedReader.<init>(BufferedReader.java:98)
        at java.io.BufferedReader.<init>(BufferedReader.java:109)
        at org.globus.cog.abstraction.impl.scheduler.slurm.QueuePoller.processStdout(QueuePoller.java:75)
. . .
Exception in thread "Timer-2" Exception in thread "Overloaded Host
Monitor" java.lang.OutOfMemoryError: Java heap space
Progress:  time: Fri, 30 Nov 2012 04:20:19 +0000  Initializing:1
Selecting site:1  Stage in:446  Submitted:1  Stage out:1213  Finished
successfully:37063
Failed but can retry:1
Fri, 30 Nov 2012 04:20:42 +0000  Initializing:1  Selecting site:1
Stage in:446  Submitted:1  Stage out:1213  Finished successfully:37063
 Failed but can re
try:1
java.lang.OutOfMemoryError: Java heap space
        at java.util.HashMap.newValueIterator(HashMap.java:856)
        at java.util.HashMap$Values.iterator(HashMap.java:923)
        at java.util.AbstractCollection.toArray(AbstractCollection.java:137)
        at java.util.ArrayList.addAll(ArrayList.java:530)
        at org.globus.cog.karajan.workflow.service.channels.ChannelContext.getActiveCommands(ChannelContext.java:171)
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:124)
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:118)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

Ominous.

If I am restarting are there special steps that I need to take to
avoid redoing work already done?

Originally I thought the slow-down was due to fair-share or increased
activity on the cluster.  Even though I am not so sure about those
theories now, can you point me to a primer on how to monitor the
cluster?  I am new to Slurm.  So far I have found the RCC documents to
only cover the broadest generalities but maybe I have overlooked
something.  Thanks.

On Thu, Nov 29, 2012 at 10:02 PM, Neil Best <nbest at ci.uchicago.edu> wrote:
> On Thu, Nov 29, 2012 at 7:15 PM, David Kelly <davidk at ci.uchicago.edu> wrote:
>> file nctable<"nc_table">;
>
> It's cruising now, David.  Thanks for the tip.