[Swift-devel] Overview of coaster block-allocation-version issues

Michael Wilde wilde at mcs.anl.gov
Fri Jun 19 08:31:32 CDT 2009



On 6/19/09 8:23 AM, Mihael Hategan wrote:
> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
>> More thoughts on this:
>>
>> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much 
>> more important issue than (1).
>>
>> It seems like this problem merits a 2-pronged attack:
>>
>> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
>>    -- is it obvious from the log whats causing the high overhead?
>>    -- its it a situation where the overhead is incurred even when
>>       jobs are not running, just queued?
> 
> Some profiling needs to be done.

Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid, 
using a simple script and dummy app so that Mihael can readily reproduce?

Mihael, do you want them to run with profiling and post results?

- Mike

> 
>> b) see if the service can be moved to a worker node
>>
>> Mike
>>
>>
>> On 6/18/09 4:59 PM, Michael Wilde wrote:
>>> Zhao and Allan have been testing the new coaster block-allocation 
>>> version on Ranger.
>>>
>>> They have reported some issues, and need to work with Mihael to better 
>>> characterize the errors, and try to reproduce them in a way that Mihael 
>>> can also do.
>>>
>>>  From working with them, I see two more issues that should be discussed 
>>> and resolved, which I think they have not yet mentioned on the list. 
>>> Zhao will discuss at least one of these, but is swamped getting a 
>>> science run completed for the SEE project.
>>>
>>> The issues:
>>>
>>> 1) Its hard to configure the time dimensions of the allocator, and to 
>>> make it work well with Swift retry parameters.  The properties listed in 
>>> the table in the User Guide coaster section need more explanation and 
>>> examples. I think Zhao in his latest run got these working OK for the 
>>> "ampl" SEE model he's running (2000 jobs, 2 hours each).  I'll work with 
>>> him on this, but help from others is welcome.
>>>
>>> 2) Allan and Zhao got kicked off of Ranger because the Coaster service 
>>> was consuming too much time on the head node, which is also "login3". We 
>>> were impacting other users, and got a "cease and desist" order from the 
>>> Ranger sysadmin.  They have at least one anecdotal "top" snapshot from 
>>> the host that indicates the service was indeed using a lot of time (on 
>>> his 2000 job x 2 hour script).  At the same time, Zhao sees a huge 
>>> coaster (service?) log. Maybe related?
>>>
>>> Allan and Zhao, please keep updates flowing to swift-devel with the list 
>>> and status of coaster issues (ideally bugzilla'ed when appropriate), and 
>>> work with Mihael to capture the logs and test cases he needs to see for 
>>> each problem.  Can you both work together to make a list, and with 
>>> Mihael to decide which items need to be tracked as bugs?
>>>
>>> Thanks,
>>>
>>> Mike
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list