[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Michael Wilde wilde at mcs.anl.gov
Tue Apr 7 00:15:23 CDT 2009


Note on below: I used 2hr30min as the time to match Glen's time, for the 
runs in which he first saw the "imbalance".

In my first tests,I had used 5 min for coasterWorkerMaxwalltime and 
specified no site or tc maxwalltime. I thought that would work, based on 
our earlier lengthy exchanges on this topic. But apparantly coasters was 
calculating some default max walltime for "cat" and it gave me an error 
about insufficient time. I was trying to gather that alolng with several 
other anomalies in another report.


On 4/7/09 12:09 AM, Michael Wilde wrote:
> com$ cat abe+qb.xml
> <config>
> 
> <pool handle="abe" >
> 
>   <profile namespace="globus" key="project">TG-CDA070002T</profile>
>   <profile namespace="globus" key="coastersPerNode">8</profile>
>   <profile namespace="globus" 
> key="coasterWorkerMaxwalltime">02:30:00</profile>
> 
>   <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" 
> jobManager="gt2:gt2:pbs" />
>   <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
>   <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
> 
> </pool>
> 
> <pool handle="qb" >
> 
>   <profile namespace="globus" key="project">TG-CDA070002T</profile>
>   <profile namespace="globus" key="coastersPerNode">8</profile>
>   <profile namespace="globus" 
> key="coasterWorkerMaxwalltime">02:30:00</profile>
> 
>   <execution provider="coaster" url="queenbee.loni-lsu.teragrid.org" 
> jobManager="gt2:gt2:pbs" />
>   <gridftp url="gsiftp://qb1.loni.org"/>
>   <workdirectory>/home/ux454325/swiftwork</workdirectory>
> 
> </pool>
> 
> </config>
> com$
> 
> 
> On 4/7/09 12:09 AM, Mihael Hategan wrote:
>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>>> The latest rev shows a similar failure on the surface, but I think 
>>> different patterns in the coaster logs.
>>>
>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped 
>>> outfile.
>>>
>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered 
>>> and finally failed, with 39 ok, 1 failure.
>>>
>>> All the logs for this run are in
>>>    /home/wilde/swift/lab/20090406-2330-72p9ale0
>>>
>>> below that are dirs for the abe and qb logs coaster and gram logs.
>>> Abe had no gram log for this run.
>>>
>>> I suspect this one is worth looking at.
>>
>> Indeed. Can you paste your sites file?
>>
>> There's some oddity there.
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list