[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Michael Wilde wilde at mcs.anl.gov
Tue Apr 7 00:09:58 CDT 2009


com$ cat abe+qb.xml
<config>

<pool handle="abe" >

   <profile namespace="globus" key="project">TG-CDA070002T</profile>
   <profile namespace="globus" key="coastersPerNode">8</profile>
   <profile namespace="globus" 
key="coasterWorkerMaxwalltime">02:30:00</profile>

   <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" 
jobManager="gt2:gt2:pbs" />
   <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
   <workdirectory>/u/ac/wilde/swiftwork</workdirectory>

</pool>

<pool handle="qb" >

   <profile namespace="globus" key="project">TG-CDA070002T</profile>
   <profile namespace="globus" key="coastersPerNode">8</profile>
   <profile namespace="globus" 
key="coasterWorkerMaxwalltime">02:30:00</profile>

   <execution provider="coaster" url="queenbee.loni-lsu.teragrid.org" 
jobManager="gt2:gt2:pbs" />
   <gridftp url="gsiftp://qb1.loni.org"/>
   <workdirectory>/home/ux454325/swiftwork</workdirectory>

</pool>

</config>
com$


On 4/7/09 12:09 AM, Mihael Hategan wrote:
> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>> The latest rev shows a similar failure on the surface, but I think 
>> different patterns in the coaster logs.
>>
>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile.
>>
>> This time 39 of 40 jobs ran on abe, and then the workflow lingered and 
>> finally failed, with 39 ok, 1 failure.
>>
>> All the logs for this run are in
>>    /home/wilde/swift/lab/20090406-2330-72p9ale0
>>
>> below that are dirs for the abe and qb logs coaster and gram logs.
>> Abe had no gram log for this run.
>>
>> I suspect this one is worth looking at.
> 
> Indeed. Can you paste your sites file?
> 
> There's some oddity there.
> 
> 



More information about the Swift-devel mailing list