[Swift-devel] estranged on ranger

Mihael Hategan hategan at mcs.anl.gov
Thu Mar 19 22:40:12 CDT 2009


For the record, the issue seems to be some FS problems. This is
suggested by the file operation tasks in
http://www.ci.uchicago.edu/~skenny/sem/report-modgenproc-20090319-2002-b0nthqyg/karajan.html
where it can be seen that before being stopped, swift was running some
very slow (166 seconds each?) file tasks.

My suspicion is that running swift on ranger's head node is bound to
cause problems (due to doubled FS load) and/or suffer from problems
caused by the myriad of other people running stuff there.

The recommendation would be to run swift from a machine @CI, preferably
on a local disk if many files are involved.

On Thu, 2009-03-19 at 21:53 -0500, skenny at uchicago.edu wrote:
> hey there, i'm having some trouble figuring out why my
> gigantic workflow is failing :) all the details are below...i
> should mention also that i ran 10k jobs with the same configs
> and it completed w/o err in about 28min. 
> 
> so i'm trying to run the 65k workflow with the latest
> build from svn. the workflow completes 244 of the jobs and
> then begins failing. it never returns an error but seems to
> hang for quite some time (though all jobs have left the q). 
> 
> from the properties file:
> 
> lazy.errors=false
> caching.algorithm=LRU
> pgraph=false
> pgraph.graph.options=splines="compound", rankdir="TB"
> pgraph.node.options=color="seagreen", style="filled"
> clustering.enabled=false
> clustering.queue.delay=4
> clustering.min.time=60
> 
> kickstart.enabled=maybe
> kickstart.always.transfer=false
> wrapperlog.always.transfer=false
> 
> throttle.submit=6
> throttle.host.submit=3
> 
> throttle.score.job.factor=8
> throttle.transfers=16
> 
> throttle.file.operations=16
> sitedir.keep=true
> execution.retries=2
> 
> replication.enabled=false
> replication.min.queue.time=60
> replication.limit=3
> foreach.max.threads=1024
> 
> from sites:
>  <!-- RANGER @ tg-login.ranger.tacc.teragrid.org -->
>   <pool handle="RANGER">
>     <profile namespace="karajan" key="initialScore">1</profile>
>     <profile namespace="karajan" key="jobThrottle">8</profile>
>     <profile namespace="globus"
> key="project">TG-DBS090006</profile>
>     <filesystem provider="coaster"
> url="gt2://gatekeeper.ranger.tacc.teragrid.org"/>
>     <profile namespace="globus" key="coastersPerNode">16</profile>
>     <execution provider="coaster"
> url="gatekeeper.ranger.tacc.teragrid.org"
> jobManager="gt2:gt2:SGE"/>
>    
> <workdirectory>/scratch/projects/tg/SIDGrid/sidgrid_out/{username}</workdirectory>
>   </pool>
> 
> ran the log plot:
> 
> http://www.ci.uchicago.edu/~skenny/sem/report-modgenproc-20090319-2002-b0nthqyg/index.html
> 
> the log itself is here on ranger:
> /scratch/projects/tg/SIDGrid/swift-logs/skenny/modgenproc-20090319-1513-m5tlihce.log
> 
> thoughts? ideas of what i might try to tweak?
> 
> thanks!
> 
> ~skenny
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list