[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!
Mihael Hategan
hategan at mcs.anl.gov
Thu Jul 24 17:30:12 CDT 2008
On Thu, 2008-07-24 at 17:21 -0500, skenny at uchicago.edu wrote:
> hmm, i think that's the only log we have from this most recent
> run. however, we saw the same behavior on another run to ncsa
> the week before. the log is here:
>
> /home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log
Not the same. In this case it seems like WS-GRAM requests are getting a
connection reset. I'm not really sure what could cause that, but it's
somewhere at the TCP level.
Anyway, can you run manual jobs on UCANL?
>
> ---- Original message ----
> >Date: Thu, 24 Jul 2008 17:17:14 -0500
> >From: Mihael Hategan <hategan at mcs.anl.gov>
> >Subject: Re: [Swift-devel] mystery runs on ucanl &
> ncsa--warning very long email, sorry!
> >To: skenny at uchicago.edu
> >Cc: swift-devel at ci.uchicago.edu, andric <mjandric at gmail.com>
> >
> >Strange. It looks like the wrapper script never gets to
> execute on
> >UCANL.
> >
> >Do you have the logs from the first run?
> >
> >Is
> >
> >On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote:
> >> so we've had some odd behavior on a big run recently and
> >> having some trouble figuring out exactly what's going on here.
> >> it's also worth mentioning that we've had other successful
> >> runs with these settings on these same sites.
> >>
> >> first, tried running on ncsa:
> >>
> >> <!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
> >> <pool handle="NCSAMERCURY">
> >> <profile namespace="karajan" key="initialScore">1</profile>
> >> <profile namespace="karajan" key="jobThrottle">2</profile>
> >> <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
> >> <execution provider="gt4" url="grid-hg.ncsa.teragrid.org"
> >> jobManager="PBS"/>
> >>
> >>
> <workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
> >> </pool>
> >>
> >> and then after failing/killing the run was resumed on ucanl64:
> >>
> >> <!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
> >> <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
> >> <profile namespace="karajan" key="initialScore">1</profile>
> >> <profile namespace="karajan" key="jobThrottle">2</profile>
> >> <profile namespace="globus"
> >> key="host_types">ia64-compute</profile>
> >> <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
> >> storage="/home/skenny/data" major="2" minor="4" patch="3"/>
> >> <execution provider="gt4" jobmanager="PBS"
> >> url="tg-grid.uc.teragrid.org" />
> >>
> >>
> <workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
> >> </pool>
> >>
> >> the workflow appears ok at first. however we would then get
> >> some failures; the retries of the failed jobs that swift
> >> submits appeared to work but the failures were keeping the run
> >> from ramping up. and eventually andric killed the run bcs
> >> there were so many errors and so few jobs running at once
> >> (though no clear indication of why).
> >>
> >> also, on ucanl, even when we kill the workflow the jobs not
> >> only remain in the queue but i can't kill them at all even
> >> when i own them (ti's looking into this i believe).
> >>
> >> the log file is pretty long so rather than attach i've put
> >> everything from the run here on the ci network:
> >> /home/skenny/andric/permFriedman_run2
> >>
> >> the individual jobs are given a 300min wallclock limit and
> >> generally take about an hour. finally, when jobs fail and/or
> >> exceed wallclock on ucanl i get an email from the pbs
> >> scheduler. in this case i get the following:
> >>
> >> PBS Job Id: 1759715.tg-master.uc.teragrid.org
> >> Job Name: STDIN
> >> Exec host: tg-c054/0
> >> Aborted by PBS Server
> >> Job cannot be executed
> >> See Administrator for help
> >>
> >> finally, our big ugly tc.data file can be seen here if that's
> >> of use:
> >>
> >>
> https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data
> >>
> >> sorry this email is so lengthy! just wanted to give you guys a
> >> full picture of what we're seeing. i'm open to any ideas, no
> >> matter how outlandish or hacky :) to try and get these running
> >> properly.
> >>
> >> thanks!!
> >> sarah
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
More information about the Swift-devel
mailing list