[Swift-devel] Problems running coaster
Mihael Hategan
hategan at mcs.anl.gov
Tue Jul 29 13:57:12 CDT 2008
On Tue, 2008-07-29 at 13:23 -0500, Michael Wilde wrote:
> Another possibility is the /dev/random delay in generating an id due ot
> lack of server entropy. Now *that* would explain things, as its right
> where the delay is occurring:
>
> private void startWorker(int maxWallTime, Task prototype)
> throws InvalidServiceContactException {
> int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<<
> if (logger.isInfoEnabled()) {
> logger.info("Starting worker with id=" + id + " and
> }
> which uses SecureRandom.getInstance("SHA1PRNG")
>
> This just occurred to me and is perhaps a more likely explanation. Is
> this the same scenario that was causing the Swift client to encounter
> long delays as it started trivial workflows? How was that eventually fixed?
Hmm. Yes. I'll change the bootstrap class to start the service
with /dev/urandom instead (if available).
>
> I can stub this out with a simple number generator and test. And/or time
> SecureRandom in a standalone program.
>
> - Mike
>
>
>
>
>
> On 7/29/08 12:06 AM, Michael Wilde wrote:
> > hmmm. my debug statement didnt print. but this time the job on abe ran ok.
> >
> > Tomorrow I'll run more tests and see how stable it is there, and why my
> > logging calls never showed up.
> >
> > - Mike
> >
> >
> > On 7/28/08 11:45 PM, Michael Wilde wrote:
> >> Ive moved on, and put a temp hack in to not use -l and instead run
> >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
> >>
> >> .myetcprofile on abe is /etc/profile with the problematic code removed.
> >>
> >> Now abe gets past the problem and runs bootstrap.sh ok.
> >>
> >> The sequence runs OK up to the point where the service on abe's
> >> headnode receives a message to start a job.
> >>
> >> AT this point, the service on abe seems to hang.
> >>
> >> Comparing to the message sequence on mercury, which works, I see this:
> >>
> >> *** mercury:
> >>
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2
> >> SUBMITJOB(identity=1217268111318
> >> executable=/bin/bash
> >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
> >> arg=shared/wrapper.sh
> >> arg=echo-myx2e6xi
> >> arg=-jobdir
> >> arg=m
> >> arg=-e
> >> arg=/bin/echo
> >> arg=-out
> >> arg=echo_s000.txt
> >> arg=-err
> >> arg=stderr.txt
> >> arg=-i
> >> arg=-d
> >> ar)
> >> [ChannelManager] DEBUG Channel multiplexer -
> >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >> [ChannelManager] DEBUG Channel multiplexer - Found
> >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2
> >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
> >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin =
> >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
> >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker
> >> found. Attempting to start a new one.
> >> [WorkerManager] INFO Worker Manager - Got allocation request:
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
> >>
> >> [WorkerManager] INFO Worker Manager - Starting worker with
> >> id=-615912369 and maxwalltime=6060s
> >> Worker start provider: gt2
> >> Worker start JM: pbs
> >>
> >> *** abe:
> >>
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2
> >> SUBMITJOB(identity=1217291444315
> >> executable=/bin/bash
> >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
> >> arg=shared/wrapper.sh
> >> arg=echo-zc5mt6xi
> >> arg=-jobdir
> >> arg=z
> >> arg=-e
> >> arg=/bin/echo
> >> arg=-out
> >> arg=echo_s000.txt
> >> arg=-err
> >> arg=stderr.txt
> >> arg=-i
> >> arg=-d
> >> arg=
> >> ar)
> >> [ChannelManager] DEBUG Channel multiplexer -
> >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >> [ChannelManager] DEBUG Channel multiplexer - Found
> >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2
> >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
> >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin =
> >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
> >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker
> >> found. Attempting to start a new one.
> >> [WorkerManager] INFO Worker Manager - Got allocation request:
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe
> >>
> >> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<:
> >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
> >>
> >> ***
> >>
> >> I *think* the SHUTDOWNSERVICE message on abe is coming much later,
> >> after abe's service hangs, but Im not sure.
> >>
> >> What it looks like to me is that what should should happen on abe is
> >> this:
> >>
> >> [WorkerManager] INFO Worker Manager - Got allocation request:
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
> >>
> >> [WorkerManager] INFO Worker Manager - Starting worker with
> >> id=-615912369 and maxwalltime=6060s
> >>
> >> but on abe the "Worker Manager - Starting worker" is never seen.
> >>
> >> Looking at WorkerManager.run() its hard to see how the "Starting
> >> worker" message could *not* show up right after "Got allocation
> >> request", but there must be some sequence of events that causes this.
> >>
> >> Abe is an 8-core system. Is there perhaps more opportunity for a
> >> multi-thread race or deadlock that could cause this?
> >>
> >> I will insert some more debug logging and try a few more times to see
> >> if thing shang in this manner every time or not.
> >>
> >> - Mike
> >>
> >> ps client Logs with abe server side boot logs are on CI net in
> >> ~wilde/coast/run11
> >>
> >>
> >>
> >> On 7/28/08 10:50 PM, Mihael Hategan wrote:
> >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
> >>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
> >>>>
> >>>>> So it looks like something in the job specs that is launching
> >>>>> coaster for
> >>>>> gt2:pbs is not being accepted by abe.
> >>>> ok. TeraGrid's unified account system is insufficiently unified for
> >>>> me to be able to access abe, but they are aware of that; if and when
> >>>> I am reunified, I'll try this out myself.
> >>>
> >>> Not to be cynical or anything, but that unified thing: never worked.
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list