[Swift-devel] Overview of coaster block-allocation-version issues

Michael Wilde wilde at mcs.anl.gov
Thu Jun 18 16:59:02 CDT 2009


Zhao and Allan have been testing the new coaster block-allocation 
version on Ranger.

They have reported some issues, and need to work with Mihael to better 
characterize the errors, and try to reproduce them in a way that Mihael 
can also do.

 From working with them, I see two more issues that should be discussed 
and resolved, which I think they have not yet mentioned on the list. 
Zhao will discuss at least one of these, but is swamped getting a 
science run completed for the SEE project.

The issues:

1) Its hard to configure the time dimensions of the allocator, and to 
make it work well with Swift retry parameters.  The properties listed in 
the table in the User Guide coaster section need more explanation and 
examples. I think Zhao in his latest run got these working OK for the 
"ampl" SEE model he's running (2000 jobs, 2 hours each).  I'll work with 
him on this, but help from others is welcome.

2) Allan and Zhao got kicked off of Ranger because the Coaster service 
was consuming too much time on the head node, which is also "login3". We 
were impacting other users, and got a "cease and desist" order from the 
Ranger sysadmin.  They have at least one anecdotal "top" snapshot from 
the host that indicates the service was indeed using a lot of time (on 
his 2000 job x 2 hour script).  At the same time, Zhao sees a huge 
coaster (service?) log. Maybe related?

Allan and Zhao, please keep updates flowing to swift-devel with the list 
and status of coaster issues (ideally bugzilla'ed when appropriate), and 
work with Mihael to capture the logs and test cases he needs to see for 
each problem.  Can you both work together to make a list, and with 
Mihael to decide which items need to be tracked as bugs?

Thanks,

Mike



More information about the Swift-devel mailing list