[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!

Thu Jul 24 17:17:14 CDT 2008

Strange. It looks like the wrapper script never gets to execute on
UCANL.

Do you have the logs from the first run?

Is 

On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote:
> so we've had some odd behavior on a big run recently and
> having some trouble figuring out exactly what's going on here.
> it's also worth mentioning that we've had other successful
> runs with these settings on these same sites.  
> 
> first, tried running on ncsa:
> 
> <!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
>   <pool handle="NCSAMERCURY">
>     <profile namespace="karajan" key="initialScore">1</profile>
>     <profile namespace="karajan" key="jobThrottle">2</profile>
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>     <execution provider="gt4" url="grid-hg.ncsa.teragrid.org"
> jobManager="PBS"/>
>    
> <workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
>   </pool>
> 
> and then after failing/killing the run was resumed on ucanl64:
> 
> <!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
>   <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
>     <profile namespace="karajan" key="initialScore">1</profile>
>     <profile namespace="karajan" key="jobThrottle">2</profile>
>     <profile namespace="globus"
> key="host_types">ia64-compute</profile>
>     <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
> storage="/home/skenny/data" major="2" minor="4" patch="3"/>
>     <execution provider="gt4" jobmanager="PBS"
> url="tg-grid.uc.teragrid.org" />
>    
> <workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
>   </pool>
> 
> the workflow appears ok at first. however we would then get
> some failures; the retries of the failed jobs that swift
> submits appeared to work but the failures were keeping the run
> from ramping up. and eventually andric killed the run bcs
> there were so many errors and so few jobs running at once
> (though no clear indication of why). 
> 
> also, on ucanl, even when we kill the workflow the jobs not
> only remain in the queue but i can't kill them at all even
> when i own them (ti's looking into this i believe).
> 
> the log file is pretty long so rather than attach i've put
> everything from the run here on the ci network:
> /home/skenny/andric/permFriedman_run2
> 
> the individual jobs are given a 300min wallclock limit and
> generally take about an hour. finally, when jobs fail and/or
> exceed wallclock on ucanl i get an email from the pbs
> scheduler. in this case i get the following:
> 
> PBS Job Id: 1759715.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Exec host:  tg-c054/0
> Aborted by PBS Server 
> Job cannot be executed
> See Administrator for help
> 
> finally, our big ugly tc.data file can be seen here if that's
> of use:
> 
> https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data
> 
> sorry this email is so lengthy! just wanted to give you guys a
> full picture of what we're seeing. i'm open to any ideas, no
> matter how outlandish or hacky :) to try and get these running
> properly.
> 
> thanks!!
> sarah
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel