[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!

Thu Jul 24 16:50:06 CDT 2008

so we've had some odd behavior on a big run recently and
having some trouble figuring out exactly what's going on here.
it's also worth mentioning that we've had other successful
runs with these settings on these same sites.  

first, tried running on ncsa:

<!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
  <pool handle="NCSAMERCURY">
    <profile namespace="karajan" key="initialScore">1</profile>
    <profile namespace="karajan" key="jobThrottle">2</profile>
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
    <execution provider="gt4" url="grid-hg.ncsa.teragrid.org"
jobManager="PBS"/>

<workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
  </pool>

and then after failing/killing the run was resumed on ucanl64:

<!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
  <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
    <profile namespace="karajan" key="initialScore">1</profile>
    <profile namespace="karajan" key="jobThrottle">2</profile>
    <profile namespace="globus"
key="host_types">ia64-compute</profile>
    <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
storage="/home/skenny/data" major="2" minor="4" patch="3"/>
    <execution provider="gt4" jobmanager="PBS"
url="tg-grid.uc.teragrid.org" />

<workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
  </pool>

the workflow appears ok at first. however we would then get
some failures; the retries of the failed jobs that swift
submits appeared to work but the failures were keeping the run
from ramping up. and eventually andric killed the run bcs
there were so many errors and so few jobs running at once
(though no clear indication of why). 

also, on ucanl, even when we kill the workflow the jobs not
only remain in the queue but i can't kill them at all even
when i own them (ti's looking into this i believe).

the log file is pretty long so rather than attach i've put
everything from the run here on the ci network:
/home/skenny/andric/permFriedman_run2

the individual jobs are given a 300min wallclock limit and
generally take about an hour. finally, when jobs fail and/or
exceed wallclock on ucanl i get an email from the pbs
scheduler. in this case i get the following:

PBS Job Id: 1759715.tg-master.uc.teragrid.org
Job Name:   STDIN
Exec host:  tg-c054/0
Aborted by PBS Server 
Job cannot be executed
See Administrator for help

finally, our big ugly tc.data file can be seen here if that's
of use:

https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data

sorry this email is so lengthy! just wanted to give you guys a
full picture of what we're seeing. i'm open to any ideas, no
matter how outlandish or hacky :) to try and get these running
properly.

thanks!!
sarah