[Swift-user] swift on midway: apps and modules

Michael Wilde wilde at mcs.anl.gov
Fri Nov 30 13:44:59 CST 2012


> From: "Neil Best" <nbest at ci.uchicago.edu>
> Sent: Friday, November 30, 2012 1:08:55 PM

> It seems like everything is slowing down:
> The swift process on the head node seems to have stalled:
> ...
> Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "Progress ticker"
> Exception in thread "PullThread" java.lang.OutOfMemoryError: Java heap
> space

> No output since then. Do I need to restart it? What do you think is
> happening?

I think that the swift command (ie the Java JVM that it runs) is out of memory. You can give it more like this:

export SWIFT_HEAP_MAX=4096M # 4GB
swift -config etc etc

I will file a ticket to get this in the User Guide.

You might be able to reduce the memory usage by lowering the value of this Swift property to something not much greater than the number of concurrent cores you expect to get from the cluster:

foreach.max.threads=1024 # default

If you only expect say 20 nodes x 12 cores = 240 concurrent app() calls, reduce this setting to something like 400 or 500.

> If I am restarting are there special steps that I need to take to
> avoid redoing work already done?

In the directory in which you ran Swift you should see a file ending in .rlog with the run-id of your latest run.

Re-issue the same swift command as you did to start the failing run, but add the argument:

-resume.file=runid.rlog

This is described in the User Guide:

http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_restarts

If you are using the runswift command from our latest RCC tutorial, you'll need to edit that script to add both the heap variable and the resume flag.

> Originally I thought the slow-down was due to fair-share or increased
> activity on the cluster. Even though I am not so sure about those
> theories now, can you point me to a primer on how to monitor the
> cluster? I am new to Slurm. So far I have found the RCC documents to
> only cover the broadest generalities but maybe I have overlooked
> something. Thanks.

This slurm command will show what you have queued and running:

squeue -u $USER -l

sinfo -l # will show the queues, called "partitions" in SLURM.

I sometimes do this to watch my jobs in a separate screen window:

watch -n 60 squeue -u $USER -l

- Mike



More information about the Swift-user mailing list