[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!

skenny at uchicago.edu skenny at uchicago.edu
Thu Jul 24 17:32:55 CDT 2008


yes (see below) and SOME of the jobs in the workflow do
complete when we submit the whole workflow to ucanl.
unfortunately i can't test anything on ncsa right now 'cause
it's down. 

[skenny at gwynn mediator]$ globusrun-ws -submit -s -F
tg-grid1.uc.teragrid.org -Ft PBS -job-command /bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:3fecbd58-59d0-11dd-8cd1-0019d1912789
Termination time: 07/25/2008 22:31 GMT
Current job state: Pending
Current job state: Active
----------------------------------------
Begin PBS Prologue Thu Jul 24 17:31:26 CDT 2008
Job ID:         1759742.tg-master.uc.teragrid.org
Username:       sidgrid
Group:          allocate
Nodes:          tg-v086
End PBS Prologue Thu Jul 24 17:31:26 CDT 2008
----------------------------------------
tg-v086.uc.teragrid.org
----------------------------------------
Begin PBS Epilogue Thu Jul 24 17:31:29 CDT 2008
Job ID:         1759742.tg-master.uc.teragrid.org
Username:       sidgrid
Group:          allocate
Job Name:       STDIN
Session:        12326
Limits:         nodes=1,walltime=00:15:00
Resources:      cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01
Nodes:          tg-v086
End PBS Epilogue Thu Jul 24 17:31:29 CDT 2008
----------------------------------------
Current job state: CleanUp-Hold
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.
[skenny at gwynn mediator]$


---- Original message ----
>Date: Thu, 24 Jul 2008 17:30:12 -0500
>From: Mihael Hategan <hategan at mcs.anl.gov>  
>Subject: Re: [Swift-devel] mystery runs on ucanl &
ncsa--warning very long	email, sorry!  
>To: skenny at uchicago.edu
>Cc: swift-devel at ci.uchicago.edu, andric <mjandric at gmail.com>
>
>On Thu, 2008-07-24 at 17:21 -0500, skenny at uchicago.edu wrote:
>> hmm, i think that's the only log we have from this most recent
>> run. however, we saw the same behavior on another run to ncsa
>> the week before. the log is here:
>> 
>>
/home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log
>
>Not the same. In this case it seems like WS-GRAM requests are
getting a
>connection reset. I'm not really sure what could cause that,
but it's
>somewhere at the TCP level.
>
>Anyway, can you run manual jobs on UCANL?
>
>> 
>> ---- Original message ----
>> >Date: Thu, 24 Jul 2008 17:17:14 -0500
>> >From: Mihael Hategan <hategan at mcs.anl.gov>  
>> >Subject: Re: [Swift-devel] mystery runs on ucanl &
>> ncsa--warning very long email, sorry!  
>> >To: skenny at uchicago.edu
>> >Cc: swift-devel at ci.uchicago.edu, andric <mjandric at gmail.com>
>> >
>> >Strange. It looks like the wrapper script never gets to
>> execute on
>> >UCANL.
>> >
>> >Do you have the logs from the first run?
>> >
>> >Is 
>> >
>> >On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote:
>> >> so we've had some odd behavior on a big run recently and
>> >> having some trouble figuring out exactly what's going on
here.
>> >> it's also worth mentioning that we've had other successful
>> >> runs with these settings on these same sites.  
>> >> 
>> >> first, tried running on ncsa:
>> >> 
>> >> <!-- NCSAMERCURY @ grid-hg.ncsa.teragrid.org -->
>> >>   <pool handle="NCSAMERCURY">
>> >>     <profile namespace="karajan"
key="initialScore">1</profile>
>> >>     <profile namespace="karajan"
key="jobThrottle">2</profile>
>> >>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>> >>     <execution provider="gt4"
url="grid-hg.ncsa.teragrid.org"
>> >> jobManager="PBS"/>
>> >>    
>> >>
>>
<workdirectory>/usr/projects/tg-community/SIDGrid/sidgrid_out/{username}</workdirectory>
>> >>   </pool>
>> >> 
>> >> and then after failing/killing the run was resumed on
ucanl64:
>> >> 
>> >> <!-- ANLUCTERAGRID64 @ tg-grid.uc.teragrid.org -->
>> >>   <pool handle="ANLUCTERAGRID64" sysinfo="INTEL32::LINUX">
>> >>     <profile namespace="karajan"
key="initialScore">1</profile>
>> >>     <profile namespace="karajan"
key="jobThrottle">2</profile>
>> >>     <profile namespace="globus"
>> >> key="host_types">ia64-compute</profile>
>> >>     <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org"
>> >> storage="/home/skenny/data" major="2" minor="4" patch="3"/>
>> >>     <execution provider="gt4" jobmanager="PBS"
>> >> url="tg-grid.uc.teragrid.org" />
>> >>    
>> >>
>>
<workdirectory>/scratch/gpfs/local/sidgrid_out/{username}</workdirectory>
>> >>   </pool>
>> >> 
>> >> the workflow appears ok at first. however we would then get
>> >> some failures; the retries of the failed jobs that swift
>> >> submits appeared to work but the failures were keeping
the run
>> >> from ramping up. and eventually andric killed the run bcs
>> >> there were so many errors and so few jobs running at once
>> >> (though no clear indication of why). 
>> >> 
>> >> also, on ucanl, even when we kill the workflow the jobs not
>> >> only remain in the queue but i can't kill them at all even
>> >> when i own them (ti's looking into this i believe).
>> >> 
>> >> the log file is pretty long so rather than attach i've put
>> >> everything from the run here on the ci network:
>> >> /home/skenny/andric/permFriedman_run2
>> >> 
>> >> the individual jobs are given a 300min wallclock limit and
>> >> generally take about an hour. finally, when jobs fail and/or
>> >> exceed wallclock on ucanl i get an email from the pbs
>> >> scheduler. in this case i get the following:
>> >> 
>> >> PBS Job Id: 1759715.tg-master.uc.teragrid.org
>> >> Job Name:   STDIN
>> >> Exec host:  tg-c054/0
>> >> Aborted by PBS Server 
>> >> Job cannot be executed
>> >> See Administrator for help
>> >> 
>> >> finally, our big ugly tc.data file can be seen here if
that's
>> >> of use:
>> >> 
>> >>
>>
https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data
>> >> 
>> >> sorry this email is so lengthy! just wanted to give you
guys a
>> >> full picture of what we're seeing. i'm open to any ideas, no
>> >> matter how outlandish or hacky :) to try and get these
running
>> >> properly.
>> >> 
>> >> thanks!!
>> >> sarah
>> >> 
>> >> _______________________________________________
>> >> Swift-devel mailing list
>> >> Swift-devel at ci.uchicago.edu
>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >
>



More information about the Swift-devel mailing list