[Swift-devel] coaster status report
Michael Wilde
wilde at mcs.anl.gov
Sat Apr 4 17:03:55 CDT 2009
small clarification here -
we had to turn away from range because the queue was gruesome.
the 3-failure issue was on abe. not much to say till we find and examine
the log on this one.
- Mike
On 4/4/09 4:59 PM, Michael Wilde wrote:
> With OOPS Glen was able to get some promising runs queued on Ranger,
> using the default properties and the sites setting from the SEM runs.
>
> Looking great so far, and above all was very easy to get it going.
>
> Thats very exciting!
>
> One run shows a few (3 out of 100 or so) failures that were retried
> successfully. We need to trak these down, and see if it was a transient
> app failure or something in swift etc.
>
> Then we turned to Abe and Queenbee. That was amazingly easy to configure
> and get running. Glen is scaling it up as we speak, trying for 2 sites x
> 40 jobs x 8 cores = 640 cores tween the two.
>
> In initial small tests, though - 50 parallel app() calls - its sending
> all jobs to abe, none to queenbee. We checked the usual sites, tc
> things, *seems* ok there. Possibly either a bg or a scheduler anomaly?
> We'll try with more jobs, and see; will send logs and sites etc files if
> that anomaly persists at larger scales.
>
> Seems like both these sites have WS-GRAM enabled; we'd like to try that
> as well, to expand beyond the 40-job per site suggested limit. Would
> like to get 1000 cores active on this problem. 2 x 60 x 8 or so.
>
> Then will add in a few more fruitful TG sites.
>
> Towards this end, Mihael, if you have the urge to probe at a
> setting/config that lets us start coasters in 4-8 node batches, this
> would be a great time to try that. I suspect you dont know yet if that
> will be easy, hard, or in between?
>
> Another note on coaster boot:
>
> - old problems on Abe with funky limitations on non-login shells seems
> to have gone away, either from the latest coaster strategy (-l issues?)
> or from Abe changes.
>
> - on queenbee, initial run got this error:
>
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
> STDOUT: Warning: -jar not understood. Ignoring.
> Exception in thread "main" java.lang.NoClassDefFoundError:
> .tmp.bootstrap.y10420
> at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0)
>
> Turns out default java was 1.4.2 something.
>
> We added @default to .soft to get Java 1.6.
> Then coasters bootstrapped fine. This was nice to see, that a simple
> workaround was easy!
>
> At any rate, very productive, very promising, very pleasing to use.
>
> Nice work!
>
> - Mike
>
>
>
>
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list