[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!

Thu Jul 24 17:30:12 CDT 2008

On Thu, 2008-07-24 at 17:21 -0500, skenny at uchicago.edu wrote:
> hmm, i think that's the only log we have from this most recent
> run. however, we saw the same behavior on another run to ncsa
> the week before. the log is here:
> 
> /home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log

Not the same. In this case it seems like WS-GRAM requests are getting a
connection reset. I'm not really sure what could cause that, but it's
somewhere at the TCP level.

Anyway, can you run manual jobs on UCANL?

> 
> ---- Original message ----
> >Date: Thu, 24 Jul 2008 17:17:14 -0500
> >From: Mihael Hategan <hategan at mcs.anl.gov>  
> >Subject: Re: [Swift-devel] mystery runs on ucanl &
> ncsa--warning very long email, sorry!  
> >To: skenny at uchicago.edu
> >Cc: swift-devel at ci.uchicago.edu, andric <mjandric at gmail.com>
> >
> >Strange. It looks like the wrapper script never gets to
> execute on
> >UCANL.
> >
> >Do you have the logs from the first run?
> >
> >Is 
> >
> >On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote:
> >> so we've had some odd behavior on a big run recently and
> >> having some trouble figuring out exactly what's going on here.
> >> it's also worth mentioning that we've had other successful
> >> runs with these settings on these same sites.  
> >> 
> >> first, tried running on ncsa:
> >> 
> >> <!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
> >>   <pool handle="NCSAMERCURY">
> >>     <profile namespace="karajan" key="initialScore">1</profile>
> >>     <profile namespace="karajan" key="jobThrottle">2</profile>
> >>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
> >>     <execution provider="gt4" url="grid-hg.ncsa.teragrid.org"
> >> jobManager="PBS"/>
> >>    
> >>
> <workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
> >>   </pool>
> >> 
> >> and then after failing/killing the run was resumed on ucanl64:
> >> 
> >> <!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
> >>   <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
> >>     <profile namespace="karajan" key="initialScore">1</profile>
> >>     <profile namespace="karajan" key="jobThrottle">2</profile>
> >>     <profile namespace="globus"
> >> key="host_types">ia64-compute</profile>
> >>     <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
> >> storage="/home/skenny/data" major="2" minor="4" patch="3"/>
> >>     <execution provider="gt4" jobmanager="PBS"
> >> url="tg-grid.uc.teragrid.org" />
> >>    
> >>
> <workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
> >>   </pool>
> >> 
> >> the workflow appears ok at first. however we would then get
> >> some failures; the retries of the failed jobs that swift
> >> submits appeared to work but the failures were keeping the run
> >> from ramping up. and eventually andric killed the run bcs
> >> there were so many errors and so few jobs running at once
> >> (though no clear indication of why). 
> >> 
> >> also, on ucanl, even when we kill the workflow the jobs not
> >> only remain in the queue but i can't kill them at all even
> >> when i own them (ti's looking into this i believe).
> >> 
> >> the log file is pretty long so rather than attach i've put
> >> everything from the run here on the ci network:
> >> /home/skenny/andric/permFriedman_run2
> >> 
> >> the individual jobs are given a 300min wallclock limit and
> >> generally take about an hour. finally, when jobs fail and/or
> >> exceed wallclock on ucanl i get an email from the pbs
> >> scheduler. in this case i get the following:
> >> 
> >> PBS Job Id: 1759715.tg-master.uc.teragrid.org
> >> Job Name:   STDIN
> >> Exec host:  tg-c054/0
> >> Aborted by PBS Server 
> >> Job cannot be executed
> >> See Administrator for help
> >> 
> >> finally, our big ugly tc.data file can be seen here if that's
> >> of use:
> >> 
> >>
> https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data
> >> 
> >> sorry this email is so lengthy! just wanted to give you guys a
> >> full picture of what we're seeing. i'm open to any ideas, no
> >> matter how outlandish or hacky :) to try and get these running
> >> properly.
> >> 
> >> thanks!!
> >> sarah
> >> 
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >