[Swift-user] job waiting

Wed Apr 22 11:25:42 CDT 2009

Hi Ben,

Yesterday, I tested my application a few times on NCSA mercury only with coaster and with the specification of globus::maxwalltime=50 in tc.data. Similar to previous try, in several runs, the application keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns respectively. Does this relate to my setting? The log for the last run is at: 

/home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log

I started to receive email with the following content after about 10 min of execution,  

/////////
PBS Job Id: 1947957.tg-master.ncsa.teragrid.org
Job Name:   null
job deleted
Job deleted at request of root at tg-master.ncsa.teragrid.org
MOAB_INFO:  job exceeded wallclock limit
/////////

However, Swift did not indicate any job failure, so should I worry about the success of those jobs? 

I also tried NCSA mercury only without coaster, but the submitted jobs do not seem to return successfully. I notice that if I use coaster, typicaly max number jobs I have on NCSA is about 130, but if I do not use coaster, I can have more than 300 jobs queued on NCSA computer. Is this related with the throttle setting?

I also tried SDSC dtf server without coaster, but the jobs submitted do not get started on SDSC dtf server. Instead, I got many error messages like the following. Should I contact teragrid for these errors?

Progress:  Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished successfully:230 Failed but can retry:45
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs

The following is my sites.xml content for NCSA mercury with and without coaster and SDSC DTF:

 <pool handle="NCSAMERCURY">
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
    <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" jobManager="gt2:PBS"/>
    <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
 </pool>
 <pool handle="NCSAMERCURY_nocoaster">
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
   <jobmanager universe="vanilla" url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" />
    <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
 </pool>
 <pool handle="SDSC_dtf_prews_pbs">
   <gridftp  url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
   <jobmanager universe="vanilla" url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
   <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
   <profile namespace="globus" key="queue">fast</profile>
 </pool>

The swift script I used is at:

/home/yuechen/PTMap2/PTMap2-unmod.swift

The tc.data I used is:

/home/yuechen/PTMap2/tc.data

I will start to try other servers to see if I can run all jobs successfully.

Thank you very much for help!

Chen, Yue

________________________________

From: Ben Clifford [mailto:benc at hawaga.org.uk]
Sent: Sun 4/19/2009 2:07 AM
To: Yue, Chen - BMD
Cc: swift user
Subject: RE: [Swift-user] job waiting

On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:

> Thanks for answering my question. This phenomena occur after half an
> hour of execution. If all the jobs finish execution at original speed,
> it would probably take not more than 40 min. How the system figure out
> that some jobs will take more than 1 hour? Should I request more time
> when I execute "grid-proxy-init"?

Not with grid-proxy-init. You can specify a parameter called maxwalltime
in your sites file or your tc.data file that will tell Swift an upper
bound on how long your job will run. In Swift 0.8, coasters assume
something like 10 minutes if you do not specify a walltime, so you will
run into trouble.

For example, change the null at the end of your tc.data lines to
globus::maxwalltime=50  to mean 50 minutes maxwalltime.

There has been work done on coasters since Swift 0.8, and so Mihael may
have some other recommendations.

> I did not change the default throttles. How much is more appropriate?
> The total number of jobs in my application typically run between 4000
> and 30000 and typically each job can be finished within a couple of
> minutes.

Where is your Swift installation? I would liek to look at it.

--

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090422/df4ad558/attachment.html>