[Swift-commit] r4970 - in trunk: bin/grid docs/siteguide

wilde at ci.uchicago.edu wilde at ci.uchicago.edu
Tue Aug 9 16:53:47 CDT 2011


Author: wilde
Date: 2011-08-09 16:53:47 -0500 (Tue, 09 Aug 2011)
New Revision: 4970

Added:
   trunk/bin/grid/run-worker.sh
Modified:
   trunk/bin/grid/TODO
   trunk/bin/grid/swift-workers
   trunk/docs/siteguide/grid
Log:
Snapshop of bin/grid - tested to 2M+ jobs with catsn. Adds run-worker wrapper and swiftDemand file, and interim siteguide text.

Modified: trunk/bin/grid/TODO
===================================================================
--- trunk/bin/grid/TODO	2011-08-09 20:20:05 UTC (rev 4969)
+++ trunk/bin/grid/TODO	2011-08-09 21:53:47 UTC (rev 4970)
@@ -2,7 +2,21 @@
 
 - why are there two logs from the coaster service: uuid-named log and swift.log????
 
+- missing .error file???
 
+- new error handling strategy ?
+
+- unique sub file for each worker job
+--- log file for each
+--- worker log on $tmp
+--- cat worker log on error
+--- set worker run time
+--- option for per-site service
+--- option for gridftp if using per-site service
+--- return log file on error if possible
+
+
+
 EXTENCI APPLICATION WORK
 
 create modft install & test file; test under fork and work

Added: trunk/bin/grid/run-worker.sh
===================================================================
--- trunk/bin/grid/run-worker.sh	                        (rev 0)
+++ trunk/bin/grid/run-worker.sh	2011-08-09 21:53:47 UTC (rev 4970)
@@ -0,0 +1,33 @@
+#! /bin/bash
+
+contact=$1
+workername=$2
+origlogdir=$3
+echo OSG_WN_TMP=$OSG_WN_TMP
+if [ _$OSG_WN_TMP = _ ]; then
+  OSG_WN_TMP=/tmp
+fi
+mkdir -p $OSG_WN_TMP
+logdir=$(mktemp -d $OSG_WN_TMP/${workername}.workerdir.XXXXXX)
+nlines=1000
+
+echo "=== contact: $contact"
+echo "=== name:    $workername Running in dir $(pwd)"
+echo "=== cwd:     $(pwd)"
+echo "=== logdir:  $logdir"
+echo "==============================================="
+
+cat >worker.pl
+chmod +x worker.pl
+
+./worker.pl $contact $workername $logdir
+
+exitcode=$?
+
+echo "=== exit: worker.pl exited with code=$exitcode"
+
+echo "=== worker log - last $nlines lines:"
+
+echo
+
+tail -v -n $nlines $logdir/*


Property changes on: trunk/bin/grid/run-worker.sh
___________________________________________________________________
Added: svn:executable
   + *

Modified: trunk/bin/grid/swift-workers
===================================================================
--- trunk/bin/grid/swift-workers	2011-08-09 20:20:05 UTC (rev 4969)
+++ trunk/bin/grid/swift-workers	2011-08-09 21:53:47 UTC (rev 4970)
@@ -10,7 +10,7 @@
 require 'etc'
 
 class Site
-  attr_accessor :grid_resource, :data_dir, :app_dir, :name, :port
+  attr_accessor :grid_resource, :gridftp, :data_dir, :app_dir, :name, :port
   attr_reader :submit_file
 
 #      executable = <%= @app_dir %>/worker.pl  # FIXME (below)
@@ -22,35 +22,22 @@
 
 # WORKER_LOGGING_LEVEL=$LOGLEVEL $HOME/swift_gridtools/worker.pl $SERVICEURL swork${worker} $LOGDIR >& /dev/null &
 
-# a mod
-
   def gen_submit(count = 1)
-    job = %q[
-      universe = grid
-      stream_output = False
-      stream_error = False
-      transfer_executable = false
-      periodic_remove = JobStatus == 5
-      notification = Never
 
-      globus_rsl = (maxwalltime=240)
-      grid_resource = <%= @grid_resource %>
-      executable = /bin/sleep
-      arguments = 300
-      log = condor.log
-
-      <% count.times { %>queue
-      <% } %>
-    ]
-
     ov=$VERBOSE
     $VERBOSE=nil
     workerExecutable = `which worker.pl`
+    workerWrapper = `which run-worker.sh`
     $VERBOSE=ov
-#    workerContact = "http://communicado.ci.uchicago.edu:36906"
-     workerContact = ARGV[2]
+#   workerContact = "http://communicado.ci.uchicago.edu:36906"
+    workerContact = ARGV[2]
 
-    newjob = %Q[
+# for submit file:      log = condor.log
+#       <% count.times { %>queue
+#       <% } %>
+#      log     = condor/$(Process).log
+
+    job = %Q[
       universe = grid
       stream_output = False
       stream_error = False
@@ -60,23 +47,23 @@
 
       globus_rsl = (maxwalltime=240)
       grid_resource = <%= @grid_resource %>
-      executable = #{workerExecutable}
-      arguments = #{workerContact} swork /tmp
+      executable = #{workerWrapper}
+      arguments = #{workerContact} <%= @name.gsub(/__.*/,"") %> /tmp
       environment = WORKER_LOGGING_LEVEL=INFO
-      log = condor.log
+      Input   = #{workerExecutable}
+      Error   = condor/$(Process).err
+      Output  = condor/$(Process).out
+      log     = condor.log
 
-      <% count.times { %>queue
-      <% } %>
+      queue #{count}
     ]
-
-    ERB.new(newjob.gsub(/^\s+/, ""), 0, "%<>", "@submit_file").result(binding)
+    ERB.new(job.gsub(/^\s+/, ""), 0, "%<>", "@submit_file").result(binding)
   end
 
   def submit_job(count)
     puts "submit_job: Submitting #{@name} #{count} jobs"
     count = count.to_i
     output = ""
-#return output
     submitfile = gen_submit(count)
     IO.popen("condor_submit", "w+") do |submit|
       submit.puts submitfile
@@ -144,9 +131,9 @@
   demandThread = Thread.new("monitor-demand") do |t|
     puts "starting demand thread"
     while true do
-    puts "in demand thread"
-      # swiftDemand = IO.read("swiftDemand")  # Replace this with sensor of Swift demand
-      swiftDemand = 15
+      puts "in demand thread"
+      swiftDemand = IO.read("swiftDemand").to_i  # Replace this with sensor of Swift demand
+      # swiftDemand = 15
       paddedDemand = (swiftDemand * 1.2).to_i
       ov=$VERBOSE;$VERBOSE=nil
       totalRunning = `condor_q #{$username} -const 'JobStatus == 2' -format \"%s \" GlobalJobId`.split(" ").size
@@ -161,6 +148,7 @@
     site               = Site.new
     site.name          = name
     site.grid_resource = "gt2 #{value.url}/jobmanager-#{value.jm}"
+    site.gridftp       = "gsiftp://#{value.url}"
     site.app_dir       = value.app_dir
     site.data_dir      = value.data_dir
     site.port          = start_port + ctr
@@ -201,9 +189,9 @@
             site.submit_job(targetQueued - queued)
           end
           trip += 1
-          sleep 60
           # puts "#{name}: #{total}"
         end
+      sleep 60
       end
     end
 

Modified: trunk/docs/siteguide/grid
===================================================================
--- trunk/docs/siteguide/grid	2011-08-09 20:20:05 UTC (rev 4969)
+++ trunk/docs/siteguide/grid	2011-08-09 21:53:47 UTC (rev 4970)
@@ -4,7 +4,8 @@
 Overview of running on grid sites
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-* Get a DOEGrids cert. Then register it in the *OSG Engage VO*, and/or map it using +gx-request+ on TeraGrid sites.
+* Get a DOEGrids cert. Then register it in the *OSG Engage VO*, and/or
+map it using +gx-request+ on TeraGrid sites.
 
 * Run +GridSetup+ to configure Swift to use the grid sites.  This tests
 for correct operation and creates a "green list" of good sites.
@@ -12,12 +13,22 @@
 * Prepare an installation package for the programs you want to run on
 grid sites via Swift, and install that package using +foreachsite+.
 
-* Run +RunWorkers+ to start and maintain a pool of Swift workers on
+* Run +swift-workers+ to start and maintain a pool of Swift workers on
 each site.
 
 * Run Swift scripts that use the grid site resources.
 
+NOTE: This revision only supports a single-entry sites file which uses
+provider staging and assumes that the necessary apps are locatable
+through the same tc entries (ie either absolute or PATH-relative
+paths) on all sites.
 
+NOTE: This revision has been testing using the bin/grid code from
+trunk (which gets installed into trunk's bin/ directory, and the base
+swift code from the 0.93 branch.  No other configurations have been
+tested at the moment.  I intend to put this code in bin/grid in 0.93,
+as it should have no ill affects on other Swift usage.
+
 Requesting Access
 ~~~~~~~~~~~~~~~~~
 
@@ -106,20 +117,85 @@
 get_greensites >greensites
 -----
 
+You can repeatedly try the get_greensites command, which simply
+concatenates all the site names that sucessfully resturned an output
+file from site tests.
 
+Starting a single coaster service
+---------------------------------
+
+This single coaster service will service all grid sites:
+
+start-grid-service --loglevel INFO --throttle 3.99 --jobspernode 1 \
+                   >& start-grid-service.out
+
+Starting workers on OSG sites
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Make sure that your "greensites" file is in the current working
+directory.
+
+The swiftDemand file should be set to contain the number of workers
+you want to start across all OSG sites. Eventually this will be set
+dynamically by watching your Swift script execute.  (Note that this
+number only includes jobs started by the swift-workers factory
+command, not by any workers added manually from the TeraGrid - see
+below.
+
+The condor directory must be pre-created and will be used by Condor to
+return stdout and stderr files from the Condor jobs, which will
+execute the wrapper script "run-workers.sh".
+
+NOTE: this script is current built manually, and wraps around and
+transports the worker.pl script. This needs to be automated.
+
+-----
+echo 250 >swiftDemand mkdir -p condor
+
+swift-workers greensites extenci \
+              http://communicado.ci.uchicago.edu:$(cat service-0.wport) \
+              >& swift-workers.out &
+-----
+
+Adding workers from TeraGrid sites
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The job below can be used to submit jobs to TareGrid (Ranger only at
+the moment) to add more workers to the execution pool. The same
+requirements hold there as for OSG sites, namely, that the app tools
+listed in tc for the single execution site need to be locatable on the
+TeraGrid site(s).
+
+-----
+start-ranger-service --nodes 1 --walltime 00:10:00 --project TG-DBS123456N \
+                     --queue development --user tg12345 --startservice no \
+                     >& start-ranger-service.out 
+-----
+
+NOTE: Change the project and user names to match your TeraGrid
+parameters.
+
 Running Swift
 ~~~~~~~~~~~~~
 Now that everything is in place, run Swift with the following command:
 
 -----
-swift -sites.file sites.xml -tc.file tc.data catsn.swift -n=10
+export SWIFT_HEAP_MAX=6000m # Add this for very large scripts
+
+swift -config cf.ps -tc.file tc -sites.file sites.grid-ps.xml \
+      catsn.swift -n=10000 >& swift.out &
 -----
 
-You should see several new files being created, called catsn.0001.out, catsn.0002.out, etc. Each of these
-files should contain the contents of what you placed into data.txt. If this happens, your job has run
-successfully on PADS!
+You should see several new files being created, called catsn.0001.out,
+catsn.0002.out, etc. Each of these files should contain the contents
+of what you placed into data.txt. If this happens, your job has run
+successfully on the grid sites.
 
 More Help
 ~~~~~~~~~
-The best place for additional help is the Swift user mailing list. You can subscribe to this list at
-http://mail.ci.uchicago.edu/mailman/listinfo/swift-user. When submitting information, please send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.
+
+The best place for additional help is the Swift user mailing list. You
+can subscribe to this list at
+http://mail.ci.uchicago.edu/mailman/listinfo/swift-user. When
+submitting information, please send your sites.xml file, your tc.data,
+and any Swift log files that were created during your attempt.




More information about the Swift-commit mailing list