[Swift-devel] Imbalanced scheduling with coasters and multiple sites
Michael Wilde
wilde at mcs.anl.gov
Tue Apr 7 00:15:23 CDT 2009
Note on below: I used 2hr30min as the time to match Glen's time, for the
runs in which he first saw the "imbalance".
In my first tests,I had used 5 min for coasterWorkerMaxwalltime and
specified no site or tc maxwalltime. I thought that would work, based on
our earlier lengthy exchanges on this topic. But apparantly coasters was
calculating some default max walltime for "cat" and it gave me an error
about insufficient time. I was trying to gather that alolng with several
other anomalies in another report.
On 4/7/09 12:09 AM, Michael Wilde wrote:
> com$ cat abe+qb.xml
> <config>
>
> <pool handle="abe" >
>
> <profile namespace="globus" key="project">TG-CDA070002T</profile>
> <profile namespace="globus" key="coastersPerNode">8</profile>
> <profile namespace="globus"
> key="coasterWorkerMaxwalltime">02:30:00</profile>
>
> <execution provider="coaster" url="grid-abe.ncsa.teragrid.org"
> jobManager="gt2:gt2:pbs" />
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
> <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
>
> </pool>
>
> <pool handle="qb" >
>
> <profile namespace="globus" key="project">TG-CDA070002T</profile>
> <profile namespace="globus" key="coastersPerNode">8</profile>
> <profile namespace="globus"
> key="coasterWorkerMaxwalltime">02:30:00</profile>
>
> <execution provider="coaster" url="queenbee.loni-lsu.teragrid.org"
> jobManager="gt2:gt2:pbs" />
> <gridftp url="gsiftp://qb1.loni.org"/>
> <workdirectory>/home/ux454325/swiftwork</workdirectory>
>
> </pool>
>
> </config>
> com$
>
>
> On 4/7/09 12:09 AM, Mihael Hategan wrote:
>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>>> The latest rev shows a similar failure on the surface, but I think
>>> different patterns in the coaster logs.
>>>
>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped
>>> outfile.
>>>
>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered
>>> and finally failed, with 39 ok, 1 failure.
>>>
>>> All the logs for this run are in
>>> /home/wilde/swift/lab/20090406-2330-72p9ale0
>>>
>>> below that are dirs for the abe and qb logs coaster and gram logs.
>>> Abe had no gram log for this run.
>>>
>>> I suspect this one is worth looking at.
>>
>> Indeed. Can you paste your sites file?
>>
>> There's some oddity there.
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list