[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!

Thu Jul 24 17:21:47 CDT 2008

hmm, i think that's the only log we have from this most recent
run. however, we saw the same behavior on another run to ncsa
the week before. the log is here:

/home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log

---- Original message ----
>Date: Thu, 24 Jul 2008 17:17:14 -0500
>From: Mihael Hategan <hategan at mcs.anl.gov>  
>Subject: Re: [Swift-devel] mystery runs on ucanl &
ncsa--warning very long email, sorry!  
>To: skenny at uchicago.edu
>Cc: swift-devel at ci.uchicago.edu, andric <mjandric at gmail.com>
>
>Strange. It looks like the wrapper script never gets to
execute on
>UCANL.
>
>Do you have the logs from the first run?
>
>Is 
>
>On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote:
>> so we've had some odd behavior on a big run recently and
>> having some trouble figuring out exactly what's going on here.
>> it's also worth mentioning that we've had other successful
>> runs with these settings on these same sites.  
>> 
>> first, tried running on ncsa:
>> 
>> <!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
>>   <pool handle="NCSAMERCURY">
>>     <profile namespace="karajan" key="initialScore">1</profile>
>>     <profile namespace="karajan" key="jobThrottle">2</profile>
>>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>>     <execution provider="gt4" url="grid-hg.ncsa.teragrid.org"
>> jobManager="PBS"/>
>>    
>>
<workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
>>   </pool>
>> 
>> and then after failing/killing the run was resumed on ucanl64:
>> 
>> <!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
>>   <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
>>     <profile namespace="karajan" key="initialScore">1</profile>
>>     <profile namespace="karajan" key="jobThrottle">2</profile>
>>     <profile namespace="globus"
>> key="host_types">ia64-compute</profile>
>>     <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
>> storage="/home/skenny/data" major="2" minor="4" patch="3"/>
>>     <execution provider="gt4" jobmanager="PBS"
>> url="tg-grid.uc.teragrid.org" />
>>    
>>
<workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
>>   </pool>
>> 
>> the workflow appears ok at first. however we would then get
>> some failures; the retries of the failed jobs that swift
>> submits appeared to work but the failures were keeping the run
>> from ramping up. and eventually andric killed the run bcs
>> there were so many errors and so few jobs running at once
>> (though no clear indication of why). 
>> 
>> also, on ucanl, even when we kill the workflow the jobs not
>> only remain in the queue but i can't kill them at all even
>> when i own them (ti's looking into this i believe).
>> 
>> the log file is pretty long so rather than attach i've put
>> everything from the run here on the ci network:
>> /home/skenny/andric/permFriedman_run2
>> 
>> the individual jobs are given a 300min wallclock limit and
>> generally take about an hour. finally, when jobs fail and/or
>> exceed wallclock on ucanl i get an email from the pbs
>> scheduler. in this case i get the following:
>> 
>> PBS Job Id: 1759715.tg-master.uc.teragrid.org
>> Job Name:   STDIN
>> Exec host:  tg-c054/0
>> Aborted by PBS Server 
>> Job cannot be executed
>> See Administrator for help
>> 
>> finally, our big ugly tc.data file can be seen here if that's
>> of use:
>> 
>>
https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data
>> 
>> sorry this email is so lengthy! just wanted to give you guys a
>> full picture of what we're seeing. i'm open to any ideas, no
>> matter how outlandish or hacky :) to try and get these running
>> properly.
>> 
>> thanks!!
>> sarah
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>