[Swift-user] Bag of Workstatitions connection timeout

Schenker, Thomas thomas.schenker at student.kit.edu
Fri Nov 2 10:18:20 CDT 2012


Hi,
i'm trying to run swift on some workstations, following these instructions: http://www.ci.uchicago.edu/swift/guides/release-0.93/siteguide/siteguide.html#_bag_of_workstations .

My coaster-service.conf looks like this:

# Location of SWIFT. If empty, PATH is searched
export SWIFT=/home/tschenker/swift-0.93/bin/swift

# Where to copy worker.pl on the remote machine for sites.xml
#export WORKER_LOCATION=$HOME
export WORKER_LOCATION=/home/ubuntu

# How to launch workers: local, ssh, cobalt, or futuregrid
export WORKER_MODE=ssh

# SSH hosts to start workers on (ssh mode only)
export WORKER_HOSTS="10.1.218.228 10.1.218.229"

# Do all the worker nodes you're using have a shared filesystem? (yes/no)
export SHARED_FILESYSTEM=no

# Username to use on worker nodes
export WORKER_USERNAME=ubuntu

# Enable SSH tunneling? (yes/no)
export SSH_TUNNELING=no

# Directory to keep log files, relative to working directory when launching start-coaster-service
export LOG_DIR=logs

# Manually define ports. If not specified, an available port will be used
export LOCAL_PORT=
export SERVICE_PORT=

# This is the IP address to which the workers will connect
# If not given, start-coaster-service tries to automatically detect
#   the IP address of this system via ifconfig
# Specify this if you have multiple network interfaces
export IPADDR=

# Location of the swift-vm-boot scripts
export SWIFTVMBOOT_DIR=$HOME/swift-vm-boot

# Swift information for creating sites.xml
export WORK=/home/tschenker/.work_swift
export QUEUE=prod-devel
export MAXTIME=20
export NODE=64

export JOBS_PER_NODE=1
export JOB_THROTTLE=0.019

start-coaster-service starts worker.pl on the workstations, but after ~30 seconds these processes die.

This is what i get when i run one of the swift examples:

~$ swift -sites.file ~/sites.xml -tc.file ~/tc.data -config cf swift-0.93/examples/swift/tutorial/if.swift
Swift 0.93 swift-r5483 cog-r3339

RunID: 20121029-2128-urhpc6eg
Failed to acquire exclusive lock on log file.
Progress:  time: Mon, 29 Oct 2012 21:28:04 +0100
Find: http://10.1.167.72:57254
Find:  keepalive(120), reconnect - http://10.1.167.72:57254
Passive queue processor initialized. Callback URI is http://127.0.1.1:43284
Progress:  time: Mon, 29 Oct 2012 21:28:34 +0100  Submitted:1
Progress:  time: Mon, 29 Oct 2012 21:29:04 +0100  Submitted:1
Failed to connect:  Connection timed out at /home/ubuntu/worker.pl line 372.
Failed to connect:  Connection timed out at /home/ubuntu/worker.pl line 372.
Progress:  time: Mon, 29 Oct 2012 21:29:34 +0100  Submitted:1
Progress:  time: Mon, 29 Oct 2012 21:30:04 +0100  Submitted:1
...


Can anybody help?


Thanks,
Thomas


More information about the Swift-user mailing list