[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Mihael Hategan hategan at mcs.anl.gov
Tue Apr 7 00:26:35 CDT 2009


On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote:
> Note on below: I used 2hr30min as the time to match Glen's time, for the 
> runs in which he first saw the "imbalance".
> 
> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and 
> specified no site or tc maxwalltime. I thought that would work, based on 
> our earlier lengthy exchanges on this topic. But apparantly coasters was 
> calculating some default max walltime for "cat" and it gave me an error 
> about insufficient time.

Right. Previously it would just loop starting workers and then not using
them because they didn't have enough time. The default walltime is 10
minutes.

>  I was trying to gather that alolng with several 
> other anomalies in another report.

Now, the oddity below is that both coaster services are started with the
same service id. Not only that, the same service id was used for
subsequent runs (the bootstrap logs contain multiple "runs"). This,
roughly, makes no sense, but I can't imagine it being cause for
goodness.

> 
> 
> On 4/7/09 12:09 AM, Michael Wilde wrote:
> > com$ cat abe+qb.xml
> > <config>
> > 
> > <pool handle="abe" >
> > 
> >   <profile namespace="globus" key="project">TG-CDA070002T</profile>
> >   <profile namespace="globus" key="coastersPerNode">8</profile>
> >   <profile namespace="globus" 
> > key="coasterWorkerMaxwalltime">02:30:00</profile>
> > 
> >   <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" 
> > jobManager="gt2:gt2:pbs" />
> >   <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
> >   <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
> > 
> > </pool>
> > 
> > <pool handle="qb" >
> > 
> >   <profile namespace="globus" key="project">TG-CDA070002T</profile>
> >   <profile namespace="globus" key="coastersPerNode">8</profile>
> >   <profile namespace="globus" 
> > key="coasterWorkerMaxwalltime">02:30:00</profile>
> > 
> >   <execution provider="coaster" url="queenbee.loni-lsu.teragrid.org" 
> > jobManager="gt2:gt2:pbs" />
> >   <gridftp url="gsiftp://qb1.loni.org"/>
> >   <workdirectory>/home/ux454325/swiftwork</workdirectory>
> > 
> > </pool>
> > 
> > </config>
> > com$
> > 
> > 
> > On 4/7/09 12:09 AM, Mihael Hategan wrote:
> >> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
> >>> The latest rev shows a similar failure on the surface, but I think 
> >>> different patterns in the coaster logs.
> >>>
> >>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped 
> >>> outfile.
> >>>
> >>> This time 39 of 40 jobs ran on abe, and then the workflow lingered 
> >>> and finally failed, with 39 ok, 1 failure.
> >>>
> >>> All the logs for this run are in
> >>>    /home/wilde/swift/lab/20090406-2330-72p9ale0
> >>>
> >>> below that are dirs for the abe and qb logs coaster and gram logs.
> >>> Abe had no gram log for this run.
> >>>
> >>> I suspect this one is worth looking at.
> >>
> >> Indeed. Can you paste your sites file?
> >>
> >> There's some oddity there.
> >>
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list