[Swift-devel] Overview of coaster block-allocation-version issues
Michael Wilde
wilde at mcs.anl.gov
Thu Jun 18 16:59:02 CDT 2009
Zhao and Allan have been testing the new coaster block-allocation
version on Ranger.
They have reported some issues, and need to work with Mihael to better
characterize the errors, and try to reproduce them in a way that Mihael
can also do.
From working with them, I see two more issues that should be discussed
and resolved, which I think they have not yet mentioned on the list.
Zhao will discuss at least one of these, but is swamped getting a
science run completed for the SEE project.
The issues:
1) Its hard to configure the time dimensions of the allocator, and to
make it work well with Swift retry parameters. The properties listed in
the table in the User Guide coaster section need more explanation and
examples. I think Zhao in his latest run got these working OK for the
"ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with
him on this, but help from others is welcome.
2) Allan and Zhao got kicked off of Ranger because the Coaster service
was consuming too much time on the head node, which is also "login3". We
were impacting other users, and got a "cease and desist" order from the
Ranger sysadmin. They have at least one anecdotal "top" snapshot from
the host that indicates the service was indeed using a lot of time (on
his 2000 job x 2 hour script). At the same time, Zhao sees a huge
coaster (service?) log. Maybe related?
Allan and Zhao, please keep updates flowing to swift-devel with the list
and status of coaster issues (ideally bugzilla'ed when appropriate), and
work with Mihael to capture the logs and test cases he needs to see for
each problem. Can you both work together to make a list, and with
Mihael to decide which items need to be tracked as bugs?
Thanks,
Mike
More information about the Swift-devel
mailing list