[Swift-devel] coaster status report

Michael Wilde wilde at mcs.anl.gov
Sat Apr 4 17:03:55 CDT 2009


small clarification here -

we had to turn away from range because the queue was gruesome.

the 3-failure issue was on abe. not much to say till we find and examine 
the log on this one.

- Mike


On 4/4/09 4:59 PM, Michael Wilde wrote:
> With OOPS Glen was able to get some promising runs queued on Ranger, 
> using the default properties and the sites setting from the SEM runs.
> 
> Looking great so far, and above all was very easy to get it going.
> 
> Thats very exciting!
> 
> One run shows a few (3 out of 100 or so) failures that were retried 
> successfully. We need to trak these down, and see if it was a transient 
> app failure or something in swift etc.
> 
> Then we turned to Abe and Queenbee. That was amazingly easy to configure 
> and get running. Glen is scaling it up as we speak, trying for 2 sites x 
> 40 jobs x 8 cores = 640 cores tween the two.
> 
> In initial small tests, though - 50 parallel app() calls - its sending 
> all jobs to abe, none to queenbee. We checked the usual sites, tc 
> things, *seems* ok there. Possibly either a bg or a scheduler anomaly?
> We'll try with more jobs, and see; will send logs and sites etc files if 
> that anomaly persists at larger scales.
> 
> Seems like both these sites have WS-GRAM enabled; we'd like to try that 
> as well, to expand beyond the 40-job per site suggested limit. Would 
> like to get 1000 cores active on this problem. 2 x 60 x 8 or so.
> 
> Then will add in a few more fruitful TG sites.
> 
> Towards this end, Mihael, if you have the urge to probe at a 
> setting/config that lets us start coasters in 4-8 node batches, this 
> would be a great time to try that. I suspect you dont know yet if that 
> will be easy, hard, or in between?
> 
> Another note on coaster boot:
> 
> - old problems on Abe with funky limitations on non-login shells seems 
> to have gone away, either from the latest coaster strategy (-l issues?) 
> or from Abe changes.
> 
> - on queenbee, initial run got this error:
> 
>     Could not start coaster service
> Caused by:
>     Task ended before registration was received.
> STDOUT: Warning: -jar not understood. Ignoring.
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> .tmp.bootstrap.y10420
>    at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0)
> 
> Turns out default java was 1.4.2 something.
> 
> We added @default to .soft to get Java 1.6.
> Then coasters bootstrapped fine. This was nice to see, that a simple 
> workaround was easy!
> 
> At any rate, very productive, very promising, very pleasing to use.
> 
> Nice work!
> 
> - Mike
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list