[Swift-commit] r4970 - in trunk: bin/grid docs/siteguide
wilde at ci.uchicago.edu
wilde at ci.uchicago.edu
Tue Aug 9 16:53:47 CDT 2011
Author: wilde
Date: 2011-08-09 16:53:47 -0500 (Tue, 09 Aug 2011)
New Revision: 4970
Added:
trunk/bin/grid/run-worker.sh
Modified:
trunk/bin/grid/TODO
trunk/bin/grid/swift-workers
trunk/docs/siteguide/grid
Log:
Snapshop of bin/grid - tested to 2M+ jobs with catsn. Adds run-worker wrapper and swiftDemand file, and interim siteguide text.
Modified: trunk/bin/grid/TODO
===================================================================
--- trunk/bin/grid/TODO 2011-08-09 20:20:05 UTC (rev 4969)
+++ trunk/bin/grid/TODO 2011-08-09 21:53:47 UTC (rev 4970)
@@ -2,7 +2,21 @@
- why are there two logs from the coaster service: uuid-named log and swift.log????
+- missing .error file???
+- new error handling strategy ?
+
+- unique sub file for each worker job
+--- log file for each
+--- worker log on $tmp
+--- cat worker log on error
+--- set worker run time
+--- option for per-site service
+--- option for gridftp if using per-site service
+--- return log file on error if possible
+
+
+
EXTENCI APPLICATION WORK
create modft install & test file; test under fork and work
Added: trunk/bin/grid/run-worker.sh
===================================================================
--- trunk/bin/grid/run-worker.sh (rev 0)
+++ trunk/bin/grid/run-worker.sh 2011-08-09 21:53:47 UTC (rev 4970)
@@ -0,0 +1,33 @@
+#! /bin/bash
+
+contact=$1
+workername=$2
+origlogdir=$3
+echo OSG_WN_TMP=$OSG_WN_TMP
+if [ _$OSG_WN_TMP = _ ]; then
+ OSG_WN_TMP=/tmp
+fi
+mkdir -p $OSG_WN_TMP
+logdir=$(mktemp -d $OSG_WN_TMP/${workername}.workerdir.XXXXXX)
+nlines=1000
+
+echo "=== contact: $contact"
+echo "=== name: $workername Running in dir $(pwd)"
+echo "=== cwd: $(pwd)"
+echo "=== logdir: $logdir"
+echo "==============================================="
+
+cat >worker.pl
+chmod +x worker.pl
+
+./worker.pl $contact $workername $logdir
+
+exitcode=$?
+
+echo "=== exit: worker.pl exited with code=$exitcode"
+
+echo "=== worker log - last $nlines lines:"
+
+echo
+
+tail -v -n $nlines $logdir/*
Property changes on: trunk/bin/grid/run-worker.sh
___________________________________________________________________
Added: svn:executable
+ *
Modified: trunk/bin/grid/swift-workers
===================================================================
--- trunk/bin/grid/swift-workers 2011-08-09 20:20:05 UTC (rev 4969)
+++ trunk/bin/grid/swift-workers 2011-08-09 21:53:47 UTC (rev 4970)
@@ -10,7 +10,7 @@
require 'etc'
class Site
- attr_accessor :grid_resource, :data_dir, :app_dir, :name, :port
+ attr_accessor :grid_resource, :gridftp, :data_dir, :app_dir, :name, :port
attr_reader :submit_file
# executable = <%= @app_dir %>/worker.pl # FIXME (below)
@@ -22,35 +22,22 @@
# WORKER_LOGGING_LEVEL=$LOGLEVEL $HOME/swift_gridtools/worker.pl $SERVICEURL swork${worker} $LOGDIR >& /dev/null &
-# a mod
-
def gen_submit(count = 1)
- job = %q[
- universe = grid
- stream_output = False
- stream_error = False
- transfer_executable = false
- periodic_remove = JobStatus == 5
- notification = Never
- globus_rsl = (maxwalltime=240)
- grid_resource = <%= @grid_resource %>
- executable = /bin/sleep
- arguments = 300
- log = condor.log
-
- <% count.times { %>queue
- <% } %>
- ]
-
ov=$VERBOSE
$VERBOSE=nil
workerExecutable = `which worker.pl`
+ workerWrapper = `which run-worker.sh`
$VERBOSE=ov
-# workerContact = "http://communicado.ci.uchicago.edu:36906"
- workerContact = ARGV[2]
+# workerContact = "http://communicado.ci.uchicago.edu:36906"
+ workerContact = ARGV[2]
- newjob = %Q[
+# for submit file: log = condor.log
+# <% count.times { %>queue
+# <% } %>
+# log = condor/$(Process).log
+
+ job = %Q[
universe = grid
stream_output = False
stream_error = False
@@ -60,23 +47,23 @@
globus_rsl = (maxwalltime=240)
grid_resource = <%= @grid_resource %>
- executable = #{workerExecutable}
- arguments = #{workerContact} swork /tmp
+ executable = #{workerWrapper}
+ arguments = #{workerContact} <%= @name.gsub(/__.*/,"") %> /tmp
environment = WORKER_LOGGING_LEVEL=INFO
- log = condor.log
+ Input = #{workerExecutable}
+ Error = condor/$(Process).err
+ Output = condor/$(Process).out
+ log = condor.log
- <% count.times { %>queue
- <% } %>
+ queue #{count}
]
-
- ERB.new(newjob.gsub(/^\s+/, ""), 0, "%<>", "@submit_file").result(binding)
+ ERB.new(job.gsub(/^\s+/, ""), 0, "%<>", "@submit_file").result(binding)
end
def submit_job(count)
puts "submit_job: Submitting #{@name} #{count} jobs"
count = count.to_i
output = ""
-#return output
submitfile = gen_submit(count)
IO.popen("condor_submit", "w+") do |submit|
submit.puts submitfile
@@ -144,9 +131,9 @@
demandThread = Thread.new("monitor-demand") do |t|
puts "starting demand thread"
while true do
- puts "in demand thread"
- # swiftDemand = IO.read("swiftDemand") # Replace this with sensor of Swift demand
- swiftDemand = 15
+ puts "in demand thread"
+ swiftDemand = IO.read("swiftDemand").to_i # Replace this with sensor of Swift demand
+ # swiftDemand = 15
paddedDemand = (swiftDemand * 1.2).to_i
ov=$VERBOSE;$VERBOSE=nil
totalRunning = `condor_q #{$username} -const 'JobStatus == 2' -format \"%s \" GlobalJobId`.split(" ").size
@@ -161,6 +148,7 @@
site = Site.new
site.name = name
site.grid_resource = "gt2 #{value.url}/jobmanager-#{value.jm}"
+ site.gridftp = "gsiftp://#{value.url}"
site.app_dir = value.app_dir
site.data_dir = value.data_dir
site.port = start_port + ctr
@@ -201,9 +189,9 @@
site.submit_job(targetQueued - queued)
end
trip += 1
- sleep 60
# puts "#{name}: #{total}"
end
+ sleep 60
end
end
Modified: trunk/docs/siteguide/grid
===================================================================
--- trunk/docs/siteguide/grid 2011-08-09 20:20:05 UTC (rev 4969)
+++ trunk/docs/siteguide/grid 2011-08-09 21:53:47 UTC (rev 4970)
@@ -4,7 +4,8 @@
Overview of running on grid sites
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-* Get a DOEGrids cert. Then register it in the *OSG Engage VO*, and/or map it using +gx-request+ on TeraGrid sites.
+* Get a DOEGrids cert. Then register it in the *OSG Engage VO*, and/or
+map it using +gx-request+ on TeraGrid sites.
* Run +GridSetup+ to configure Swift to use the grid sites. This tests
for correct operation and creates a "green list" of good sites.
@@ -12,12 +13,22 @@
* Prepare an installation package for the programs you want to run on
grid sites via Swift, and install that package using +foreachsite+.
-* Run +RunWorkers+ to start and maintain a pool of Swift workers on
+* Run +swift-workers+ to start and maintain a pool of Swift workers on
each site.
* Run Swift scripts that use the grid site resources.
+NOTE: This revision only supports a single-entry sites file which uses
+provider staging and assumes that the necessary apps are locatable
+through the same tc entries (ie either absolute or PATH-relative
+paths) on all sites.
+NOTE: This revision has been testing using the bin/grid code from
+trunk (which gets installed into trunk's bin/ directory, and the base
+swift code from the 0.93 branch. No other configurations have been
+tested at the moment. I intend to put this code in bin/grid in 0.93,
+as it should have no ill affects on other Swift usage.
+
Requesting Access
~~~~~~~~~~~~~~~~~
@@ -106,20 +117,85 @@
get_greensites >greensites
-----
+You can repeatedly try the get_greensites command, which simply
+concatenates all the site names that sucessfully resturned an output
+file from site tests.
+Starting a single coaster service
+---------------------------------
+
+This single coaster service will service all grid sites:
+
+start-grid-service --loglevel INFO --throttle 3.99 --jobspernode 1 \
+ >& start-grid-service.out
+
+Starting workers on OSG sites
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Make sure that your "greensites" file is in the current working
+directory.
+
+The swiftDemand file should be set to contain the number of workers
+you want to start across all OSG sites. Eventually this will be set
+dynamically by watching your Swift script execute. (Note that this
+number only includes jobs started by the swift-workers factory
+command, not by any workers added manually from the TeraGrid - see
+below.
+
+The condor directory must be pre-created and will be used by Condor to
+return stdout and stderr files from the Condor jobs, which will
+execute the wrapper script "run-workers.sh".
+
+NOTE: this script is current built manually, and wraps around and
+transports the worker.pl script. This needs to be automated.
+
+-----
+echo 250 >swiftDemand mkdir -p condor
+
+swift-workers greensites extenci \
+ http://communicado.ci.uchicago.edu:$(cat service-0.wport) \
+ >& swift-workers.out &
+-----
+
+Adding workers from TeraGrid sites
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The job below can be used to submit jobs to TareGrid (Ranger only at
+the moment) to add more workers to the execution pool. The same
+requirements hold there as for OSG sites, namely, that the app tools
+listed in tc for the single execution site need to be locatable on the
+TeraGrid site(s).
+
+-----
+start-ranger-service --nodes 1 --walltime 00:10:00 --project TG-DBS123456N \
+ --queue development --user tg12345 --startservice no \
+ >& start-ranger-service.out
+-----
+
+NOTE: Change the project and user names to match your TeraGrid
+parameters.
+
Running Swift
~~~~~~~~~~~~~
Now that everything is in place, run Swift with the following command:
-----
-swift -sites.file sites.xml -tc.file tc.data catsn.swift -n=10
+export SWIFT_HEAP_MAX=6000m # Add this for very large scripts
+
+swift -config cf.ps -tc.file tc -sites.file sites.grid-ps.xml \
+ catsn.swift -n=10000 >& swift.out &
-----
-You should see several new files being created, called catsn.0001.out, catsn.0002.out, etc. Each of these
-files should contain the contents of what you placed into data.txt. If this happens, your job has run
-successfully on PADS!
+You should see several new files being created, called catsn.0001.out,
+catsn.0002.out, etc. Each of these files should contain the contents
+of what you placed into data.txt. If this happens, your job has run
+successfully on the grid sites.
More Help
~~~~~~~~~
-The best place for additional help is the Swift user mailing list. You can subscribe to this list at
-http://mail.ci.uchicago.edu/mailman/listinfo/swift-user. When submitting information, please send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.
+
+The best place for additional help is the Swift user mailing list. You
+can subscribe to this list at
+http://mail.ci.uchicago.edu/mailman/listinfo/swift-user. When
+submitting information, please send your sites.xml file, your tc.data,
+and any Swift log files that were created during your attempt.
More information about the Swift-commit
mailing list