[Swift-devel] Imbalanced scheduling with coasters and multiple sites

Michael Wilde wilde at mcs.anl.gov
Tue Apr 7 00:33:54 CDT 2009



On 4/7/09 12:26 AM, Mihael Hategan wrote:
> On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote:
>> Note on below: I used 2hr30min as the time to match Glen's time, for the 
>> runs in which he first saw the "imbalance".
>>
>> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and 
>> specified no site or tc maxwalltime. I thought that would work, based on 
>> our earlier lengthy exchanges on this topic. But apparantly coasters was 
>> calculating some default max walltime for "cat" and it gave me an error 
>> about insufficient time.
> 
> Right. Previously it would just loop starting workers and then not using
> them because they didn't have enough time. The default walltime is 10
> minutes.

That makes sense then. The error I got was:

2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=cat-e3agg19j - Application exception: Job cannot be run with the 
given max walltime worker constraint

The other few anomalies I saw I will ignore unless they happen again, as 
I was using the bad 3/31 revision. This was things like starting a new 
service with some strange default max time ("01:41:00" or 101 minutes) 
after the initial services were started with the correct time, and some 
strange error retry behavior.

Bear with me - these things are very difficult and tedious to report.

>>  I was trying to gather that alolng with several 
>> other anomalies in another report.
> 
> Now, the oddity below is that both coaster services are started with the
> same service id. Not only that, the same service id was used for
> subsequent runs (the bootstrap logs contain multiple "runs"). This,
> roughly, makes no sense, but I can't imagine it being cause for
> goodness.

OK. Any chance I messed up copying log files (and duplicated one) or are 
you seeing the duplicate service id in truly distinct logs?

(No need for reply - Im assuming if there was a chance I duplicated a 
log it would be obvious...)

> 
>>
>> On 4/7/09 12:09 AM, Michael Wilde wrote:
>>> com$ cat abe+qb.xml
>>> <config>
>>>
>>> <pool handle="abe" >
>>>
>>>   <profile namespace="globus" key="project">TG-CDA070002T</profile>
>>>   <profile namespace="globus" key="coastersPerNode">8</profile>
>>>   <profile namespace="globus" 
>>> key="coasterWorkerMaxwalltime">02:30:00</profile>
>>>
>>>   <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" 
>>> jobManager="gt2:gt2:pbs" />
>>>   <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
>>>   <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
>>>
>>> </pool>
>>>
>>> <pool handle="qb" >
>>>
>>>   <profile namespace="globus" key="project">TG-CDA070002T</profile>
>>>   <profile namespace="globus" key="coastersPerNode">8</profile>
>>>   <profile namespace="globus" 
>>> key="coasterWorkerMaxwalltime">02:30:00</profile>
>>>
>>>   <execution provider="coaster" url="queenbee.loni-lsu.teragrid.org" 
>>> jobManager="gt2:gt2:pbs" />
>>>   <gridftp url="gsiftp://qb1.loni.org"/>
>>>   <workdirectory>/home/ux454325/swiftwork</workdirectory>
>>>
>>> </pool>
>>>
>>> </config>
>>> com$
>>>
>>>
>>> On 4/7/09 12:09 AM, Mihael Hategan wrote:
>>>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>>>>> The latest rev shows a similar failure on the surface, but I think 
>>>>> different patterns in the coaster logs.
>>>>>
>>>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped 
>>>>> outfile.
>>>>>
>>>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered 
>>>>> and finally failed, with 39 ok, 1 failure.
>>>>>
>>>>> All the logs for this run are in
>>>>>    /home/wilde/swift/lab/20090406-2330-72p9ale0
>>>>>
>>>>> below that are dirs for the abe and qb logs coaster and gram logs.
>>>>> Abe had no gram log for this run.
>>>>>
>>>>> I suspect this one is worth looking at.
>>>> Indeed. Can you paste your sites file?
>>>>
>>>> There's some oddity there.
>>>>
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list