[Swift-devel] Overview of coaster block-allocation-version issues
Mihael Hategan
hategan at mcs.anl.gov
Fri Jun 19 08:23:28 CDT 2009
On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote:
> More thoughts on this:
>
> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much
> more important issue than (1).
>
> It seems like this problem merits a 2-pronged attack:
>
> a) reduce the overhead. Is it logging, or intrinsic to the protocol?
> -- is it obvious from the log whats causing the high overhead?
> -- its it a situation where the overhead is incurred even when
> jobs are not running, just queued?
Some profiling needs to be done.
> b) see if the service can be moved to a worker node
>
> Mike
>
>
> On 6/18/09 4:59 PM, Michael Wilde wrote:
> > Zhao and Allan have been testing the new coaster block-allocation
> > version on Ranger.
> >
> > They have reported some issues, and need to work with Mihael to better
> > characterize the errors, and try to reproduce them in a way that Mihael
> > can also do.
> >
> > From working with them, I see two more issues that should be discussed
> > and resolved, which I think they have not yet mentioned on the list.
> > Zhao will discuss at least one of these, but is swamped getting a
> > science run completed for the SEE project.
> >
> > The issues:
> >
> > 1) Its hard to configure the time dimensions of the allocator, and to
> > make it work well with Swift retry parameters. The properties listed in
> > the table in the User Guide coaster section need more explanation and
> > examples. I think Zhao in his latest run got these working OK for the
> > "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with
> > him on this, but help from others is welcome.
> >
> > 2) Allan and Zhao got kicked off of Ranger because the Coaster service
> > was consuming too much time on the head node, which is also "login3". We
> > were impacting other users, and got a "cease and desist" order from the
> > Ranger sysadmin. They have at least one anecdotal "top" snapshot from
> > the host that indicates the service was indeed using a lot of time (on
> > his 2000 job x 2 hour script). At the same time, Zhao sees a huge
> > coaster (service?) log. Maybe related?
> >
> > Allan and Zhao, please keep updates flowing to swift-devel with the list
> > and status of coaster issues (ideally bugzilla'ed when appropriate), and
> > work with Mihael to capture the logs and test cases he needs to see for
> > each problem. Can you both work together to make a list, and with
> > Mihael to decide which items need to be tracked as bugs?
> >
> > Thanks,
> >
> > Mike
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list