[Swift-devel] Imbalanced scheduling with coasters and multiple sites
Michael Wilde
wilde at mcs.anl.gov
Tue Apr 7 00:33:54 CDT 2009
On 4/7/09 12:26 AM, Mihael Hategan wrote:
> On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote:
>> Note on below: I used 2hr30min as the time to match Glen's time, for the
>> runs in which he first saw the "imbalance".
>>
>> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and
>> specified no site or tc maxwalltime. I thought that would work, based on
>> our earlier lengthy exchanges on this topic. But apparantly coasters was
>> calculating some default max walltime for "cat" and it gave me an error
>> about insufficient time.
>
> Right. Previously it would just loop starting workers and then not using
> them because they didn't have enough time. The default walltime is 10
> minutes.
That makes sense then. The error I got was:
2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=cat-e3agg19j - Application exception: Job cannot be run with the
given max walltime worker constraint
The other few anomalies I saw I will ignore unless they happen again, as
I was using the bad 3/31 revision. This was things like starting a new
service with some strange default max time ("01:41:00" or 101 minutes)
after the initial services were started with the correct time, and some
strange error retry behavior.
Bear with me - these things are very difficult and tedious to report.
>> I was trying to gather that alolng with several
>> other anomalies in another report.
>
> Now, the oddity below is that both coaster services are started with the
> same service id. Not only that, the same service id was used for
> subsequent runs (the bootstrap logs contain multiple "runs"). This,
> roughly, makes no sense, but I can't imagine it being cause for
> goodness.
OK. Any chance I messed up copying log files (and duplicated one) or are
you seeing the duplicate service id in truly distinct logs?
(No need for reply - Im assuming if there was a chance I duplicated a
log it would be obvious...)
>
>>
>> On 4/7/09 12:09 AM, Michael Wilde wrote:
>>> com$ cat abe+qb.xml
>>> <config>
>>>
>>> <pool handle="abe" >
>>>
>>> <profile namespace="globus" key="project">TG-CDA070002T</profile>
>>> <profile namespace="globus" key="coastersPerNode">8</profile>
>>> <profile namespace="globus"
>>> key="coasterWorkerMaxwalltime">02:30:00</profile>
>>>
>>> <execution provider="coaster" url="grid-abe.ncsa.teragrid.org"
>>> jobManager="gt2:gt2:pbs" />
>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
>>> <workdirectory>/u/ac/wilde/swiftwork</workdirectory>
>>>
>>> </pool>
>>>
>>> <pool handle="qb" >
>>>
>>> <profile namespace="globus" key="project">TG-CDA070002T</profile>
>>> <profile namespace="globus" key="coastersPerNode">8</profile>
>>> <profile namespace="globus"
>>> key="coasterWorkerMaxwalltime">02:30:00</profile>
>>>
>>> <execution provider="coaster" url="queenbee.loni-lsu.teragrid.org"
>>> jobManager="gt2:gt2:pbs" />
>>> <gridftp url="gsiftp://qb1.loni.org"/>
>>> <workdirectory>/home/ux454325/swiftwork</workdirectory>
>>>
>>> </pool>
>>>
>>> </config>
>>> com$
>>>
>>>
>>> On 4/7/09 12:09 AM, Mihael Hategan wrote:
>>>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>>>>> The latest rev shows a similar failure on the surface, but I think
>>>>> different patterns in the coaster logs.
>>>>>
>>>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped
>>>>> outfile.
>>>>>
>>>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered
>>>>> and finally failed, with 39 ok, 1 failure.
>>>>>
>>>>> All the logs for this run are in
>>>>> /home/wilde/swift/lab/20090406-2330-72p9ale0
>>>>>
>>>>> below that are dirs for the abe and qb logs coaster and gram logs.
>>>>> Abe had no gram log for this run.
>>>>>
>>>>> I suspect this one is worth looking at.
>>>> Indeed. Can you paste your sites file?
>>>>
>>>> There's some oddity there.
>>>>
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list