[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!
Mihael Hategan
hategan at mcs.anl.gov
Thu Jul 24 17:17:14 CDT 2008
Strange. It looks like the wrapper script never gets to execute on
UCANL.
Do you have the logs from the first run?
Is
On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote:
> so we've had some odd behavior on a big run recently and
> having some trouble figuring out exactly what's going on here.
> it's also worth mentioning that we've had other successful
> runs with these settings on these same sites.
>
> first, tried running on ncsa:
>
> <!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
> <pool handle="NCSAMERCURY">
> <profile namespace="karajan" key="initialScore">1</profile>
> <profile namespace="karajan" key="jobThrottle">2</profile>
> <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
> <execution provider="gt4" url="grid-hg.ncsa.teragrid.org"
> jobManager="PBS"/>
>
> <workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
> </pool>
>
> and then after failing/killing the run was resumed on ucanl64:
>
> <!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
> <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
> <profile namespace="karajan" key="initialScore">1</profile>
> <profile namespace="karajan" key="jobThrottle">2</profile>
> <profile namespace="globus"
> key="host_types">ia64-compute</profile>
> <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
> storage="/home/skenny/data" major="2" minor="4" patch="3"/>
> <execution provider="gt4" jobmanager="PBS"
> url="tg-grid.uc.teragrid.org" />
>
> <workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
> </pool>
>
> the workflow appears ok at first. however we would then get
> some failures; the retries of the failed jobs that swift
> submits appeared to work but the failures were keeping the run
> from ramping up. and eventually andric killed the run bcs
> there were so many errors and so few jobs running at once
> (though no clear indication of why).
>
> also, on ucanl, even when we kill the workflow the jobs not
> only remain in the queue but i can't kill them at all even
> when i own them (ti's looking into this i believe).
>
> the log file is pretty long so rather than attach i've put
> everything from the run here on the ci network:
> /home/skenny/andric/permFriedman_run2
>
> the individual jobs are given a 300min wallclock limit and
> generally take about an hour. finally, when jobs fail and/or
> exceed wallclock on ucanl i get an email from the pbs
> scheduler. in this case i get the following:
>
> PBS Job Id: 1759715.tg-master.uc.teragrid.org
> Job Name: STDIN
> Exec host: tg-c054/0
> Aborted by PBS Server
> Job cannot be executed
> See Administrator for help
>
> finally, our big ugly tc.data file can be seen here if that's
> of use:
>
> https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data
>
> sorry this email is so lengthy! just wanted to give you guys a
> full picture of what we're seeing. i'm open to any ideas, no
> matter how outlandish or hacky :) to try and get these running
> properly.
>
> thanks!!
> sarah
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list