[Swift-user] job waiting

Mihael Hategan hategan at mcs.anl.gov
Wed Apr 22 11:44:14 CDT 2009


This behavior was observed previously with the version you have. I
strongly recommend upgrading to the version Ben mentions.

On Wed, 2009-04-22 at 11:25 -0500, Yue, Chen - BMD wrote:
> Hi Ben,
>  
> Yesterday, I tested my application a few times on NCSA mercury only
> with coaster and with the specification of globus::maxwalltime=50 in
> tc.data. Similar to previous try, in several runs, the application
> keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns
> respectively. Does this relate to my setting? The log for the last run
> is at: 
>  
> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log
> 
> I started to receive email with the following content after about 10
> min of execution,  
>  
> /////////
> PBS Job Id: 1947957.tg-master.ncsa.teragrid.org
> Job Name:   null
> job deleted
> Job deleted at request of root at tg-master.ncsa.teragrid.org
> MOAB_INFO:  job exceeded wallclock limit
> /////////
>  
> However, Swift did not indicate any job failure, so should I worry
> about the success of those jobs? 
>  
> I also tried NCSA mercury only without coaster, but the submitted jobs
> do not seem to return successfully. I notice that if I use coaster,
> typicaly max number jobs I have on NCSA is about 130, but if I do not
> use coaster, I can have more than 300 jobs queued on NCSA computer. Is
> this related with the throttle setting?
>  
> I also tried SDSC dtf server without coaster, but the jobs submitted
> do not get started on SDSC dtf server. Instead, I got many error
> messages like the following. Should I contact teragrid for these
> errors?
>  
> Progress:  Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished
> successfully:230 Failed but can retry:45
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
>  
> The following is my sites.xml content for NCSA mercury with and
> without coaster and SDSC DTF:
>  
>  <pool handle="NCSAMERCURY">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org"
> jobManager="gt2:PBS"/>
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>  </pool>
>  <pool handle="NCSAMERCURY_nocoaster">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>    <jobmanager universe="vanilla"
> url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" />
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>  </pool>
>  <pool handle="SDSC_dtf_prews_pbs">
>    <gridftp  url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
>    <jobmanager universe="vanilla"
> url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
>    <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
>    <profile namespace="globus" key="queue">fast</profile>
>  </pool>
>  
> The swift script I used is at:
>  
> /home/yuechen/PTMap2/PTMap2-unmod.swift
>  
> The tc.data I used is:
>  
> /home/yuechen/PTMap2/tc.data
>  
> I will start to try other servers to see if I can run all jobs
> successfully.
>  
> Thank you very much for help!
>  
> Chen, Yue
>  
>  
> 
>  
> 
> 
> 
>  
>  
>  
> 
> 
> ______________________________________________________________________
> From: Ben Clifford [mailto:benc at hawaga.org.uk]
> Sent: Sun 4/19/2009 2:07 AM
> To: Yue, Chen - BMD
> Cc: swift user
> Subject: RE: [Swift-user] job waiting
> 
> 
> 
> On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:
> 
> > Thanks for answering my question. This phenomena occur after half an
> > hour of execution. If all the jobs finish execution at original
> speed,
> > it would probably take not more than 40 min. How the system figure
> out
> > that some jobs will take more than 1 hour? Should I request more
> time
> > when I execute "grid-proxy-init"?
> 
> Not with grid-proxy-init. You can specify a parameter called
> maxwalltime
> in your sites file or your tc.data file that will tell Swift an upper
> bound on how long your job will run. In Swift 0.8, coasters assume
> something like 10 minutes if you do not specify a walltime, so you
> will
> run into trouble.
> 
> For example, change the null at the end of your tc.data lines to
> globus::maxwalltime=50  to mean 50 minutes maxwalltime.
> 
> There has been work done on coasters since Swift 0.8, and so Mihael
> may
> have some other recommendations.
> 
> > I did not change the default throttles. How much is more
> appropriate?
> > The total number of jobs in my application typically run between
> 4000
> > and 30000 and typically each job can be finished within a couple of
> > minutes.
> 
> Where is your Swift installation? I would liek to look at it.
> 
> --
> 
> 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user




More information about the Swift-user mailing list