[Swift-user] job waiting
Mihael Hategan
hategan at mcs.anl.gov
Wed Apr 22 11:44:14 CDT 2009
This behavior was observed previously with the version you have. I
strongly recommend upgrading to the version Ben mentions.
On Wed, 2009-04-22 at 11:25 -0500, Yue, Chen - BMD wrote:
> Hi Ben,
>
> Yesterday, I tested my application a few times on NCSA mercury only
> with coaster and with the specification of globus::maxwalltime=50 in
> tc.data. Similar to previous try, in several runs, the application
> keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns
> respectively. Does this relate to my setting? The log for the last run
> is at:
>
> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log
>
> I started to receive email with the following content after about 10
> min of execution,
>
> /////////
> PBS Job Id: 1947957.tg-master.ncsa.teragrid.org
> Job Name: null
> job deleted
> Job deleted at request of root at tg-master.ncsa.teragrid.org
> MOAB_INFO: job exceeded wallclock limit
> /////////
>
> However, Swift did not indicate any job failure, so should I worry
> about the success of those jobs?
>
> I also tried NCSA mercury only without coaster, but the submitted jobs
> do not seem to return successfully. I notice that if I use coaster,
> typicaly max number jobs I have on NCSA is about 130, but if I do not
> use coaster, I can have more than 300 jobs queued on NCSA computer. Is
> this related with the throttle setting?
>
> I also tried SDSC dtf server without coaster, but the jobs submitted
> do not get started on SDSC dtf server. Instead, I got many error
> messages like the following. Should I contact teragrid for these
> errors?
>
> Progress: Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished
> successfully:230 Failed but can retry:45
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
>
> The following is my sites.xml content for NCSA mercury with and
> without coaster and SDSC DTF:
>
> <pool handle="NCSAMERCURY">
> <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
> <execution provider="coaster" url="grid-hg.ncsa.teragrid.org"
> jobManager="gt2:PBS"/>
> <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
> </pool>
> <pool handle="NCSAMERCURY_nocoaster">
> <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
> <jobmanager universe="vanilla"
> url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" />
> <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
> </pool>
> <pool handle="SDSC_dtf_prews_pbs">
> <gridftp url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
> <jobmanager universe="vanilla"
> url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
> <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
> <profile namespace="globus" key="queue">fast</profile>
> </pool>
>
> The swift script I used is at:
>
> /home/yuechen/PTMap2/PTMap2-unmod.swift
>
> The tc.data I used is:
>
> /home/yuechen/PTMap2/tc.data
>
> I will start to try other servers to see if I can run all jobs
> successfully.
>
> Thank you very much for help!
>
> Chen, Yue
>
>
>
>
>
>
>
>
>
>
>
>
> ______________________________________________________________________
> From: Ben Clifford [mailto:benc at hawaga.org.uk]
> Sent: Sun 4/19/2009 2:07 AM
> To: Yue, Chen - BMD
> Cc: swift user
> Subject: RE: [Swift-user] job waiting
>
>
>
> On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:
>
> > Thanks for answering my question. This phenomena occur after half an
> > hour of execution. If all the jobs finish execution at original
> speed,
> > it would probably take not more than 40 min. How the system figure
> out
> > that some jobs will take more than 1 hour? Should I request more
> time
> > when I execute "grid-proxy-init"?
>
> Not with grid-proxy-init. You can specify a parameter called
> maxwalltime
> in your sites file or your tc.data file that will tell Swift an upper
> bound on how long your job will run. In Swift 0.8, coasters assume
> something like 10 minutes if you do not specify a walltime, so you
> will
> run into trouble.
>
> For example, change the null at the end of your tc.data lines to
> globus::maxwalltime=50 to mean 50 minutes maxwalltime.
>
> There has been work done on coasters since Swift 0.8, and so Mihael
> may
> have some other recommendations.
>
> > I did not change the default throttles. How much is more
> appropriate?
> > The total number of jobs in my application typically run between
> 4000
> > and 30000 and typically each job can be finished within a couple of
> > minutes.
>
> Where is your Swift installation? I would liek to look at it.
>
> --
>
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list