[Swift-devel] Problems running coaster

Mihael Hategan hategan at mcs.anl.gov
Tue Jul 29 13:57:12 CDT 2008


On Tue, 2008-07-29 at 13:23 -0500, Michael Wilde wrote:

> Another possibility is the /dev/random delay in generating an id due ot 
> lack of server entropy. Now *that* would explain things, as its right 
> where the delay is occurring:
> 
>      private void startWorker(int maxWallTime, Task prototype)
>              throws InvalidServiceContactException {
>          int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<<
>          if (logger.isInfoEnabled()) {
>              logger.info("Starting worker with id=" + id + " and
>          }
> which uses SecureRandom.getInstance("SHA1PRNG")
> 
> This just occurred to me and is perhaps a more likely explanation. Is 
> this the same scenario that was causing the Swift client to encounter 
> long delays as it started trivial workflows?  How was that eventually fixed?

Hmm. Yes. I'll change the bootstrap class to start the service
with /dev/urandom instead (if available).

> 
> I can stub this out with a simple number generator and test. And/or time 
> SecureRandom in a standalone program.
> 
> - Mike
> 
> 
> 
> 
> 
> On 7/29/08 12:06 AM, Michael Wilde wrote:
> > hmmm. my debug statement didnt print. but this time the job on abe ran ok.
> > 
> > Tomorrow I'll run more tests and see how stable it is there, and why my 
> > logging calls never showed up.
> > 
> > - Mike
> > 
> > 
> > On 7/28/08 11:45 PM, Michael Wilde wrote:
> >> Ive moved on, and put a temp hack in to not use -l and instead run 
> >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
> >>
> >> .myetcprofile on abe is /etc/profile with the problematic code removed.
> >>
> >> Now abe gets past the problem and runs bootstrap.sh ok.
> >>
> >> The sequence runs OK up to the point where the service on abe's 
> >> headnode  receives a message to start a job.
> >>
> >> AT this point, the service on abe seems to hang.
> >>
> >> Comparing to the message sequence on mercury, which works, I see this:
> >>
> >> *** mercury:
> >>
> >> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
> >> SUBMITJOB(identity=1217268111318
> >> executable=/bin/bash
> >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
> >> arg=shared/wrapper.sh
> >> arg=echo-myx2e6xi
> >> arg=-jobdir
> >> arg=m
> >> arg=-e
> >> arg=/bin/echo
> >> arg=-out
> >> arg=echo_s000.txt
> >> arg=-err
> >> arg=stderr.txt
> >> arg=-i
> >> arg=-d
> >> ar)
> >> [ChannelManager] DEBUG Channel multiplexer  -
> >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >> [ChannelManager] DEBUG Channel multiplexer  - Found 
> >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
> >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
> >> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
> >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
> >> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> >> found. Attempting to start a new one.
> >> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
> >>
> >> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> >> id=-615912369 and maxwalltime=6060s
> >> Worker start provider: gt2
> >> Worker start JM: pbs
> >>
> >> *** abe:
> >>
> >> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
> >> SUBMITJOB(identity=1217291444315
> >> executable=/bin/bash
> >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
> >> arg=shared/wrapper.sh
> >> arg=echo-zc5mt6xi
> >> arg=-jobdir
> >> arg=z
> >> arg=-e
> >> arg=/bin/echo
> >> arg=-out
> >> arg=echo_s000.txt
> >> arg=-err
> >> arg=stderr.txt
> >> arg=-i
> >> arg=-d
> >> arg=
> >> ar)
> >> [ChannelManager] DEBUG Channel multiplexer  -
> >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >> [ChannelManager] DEBUG Channel multiplexer  - Found 
> >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
> >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
> >> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
> >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
> >> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> >> found. Attempting to start a new one.
> >> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe 
> >>
> >> [AbstractKarajanChannel] DEBUG Channel multiplexer  - GSSC-null REQ<: 
> >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
> >>
> >> ***
> >>
> >> I *think* the SHUTDOWNSERVICE message on abe is coming much later, 
> >> after abe's service hangs, but Im not sure.
> >>
> >> What it looks like to me is that what should should happen on abe is 
> >> this:
> >>
> >> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
> >>
> >> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> >> id=-615912369 and maxwalltime=6060s
> >>
> >> but on abe the "Worker Manager  - Starting worker" is never seen.
> >>
> >> Looking at WorkerManager.run() its hard to see how the "Starting 
> >> worker" message could *not* show up right after "Got allocation 
> >> request", but there must be some sequence of events that causes this.
> >>
> >> Abe is an 8-core system. Is there perhaps more opportunity for a 
> >> multi-thread race or deadlock that could cause this?
> >>
> >> I will insert some more debug logging and try a few more times to see 
> >> if  thing shang in this manner every time or not.
> >>
> >> - Mike
> >>
> >> ps client Logs with abe server side boot logs are on CI net in 
> >> ~wilde/coast/run11
> >>
> >>
> >>
> >> On 7/28/08 10:50 PM, Mihael Hategan wrote:
> >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
> >>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
> >>>>
> >>>>> So it looks like something in the job specs that is launching 
> >>>>> coaster for
> >>>>> gt2:pbs is not being accepted by abe.
> >>>> ok. TeraGrid's unified account system is insufficiently unified for 
> >>>> me to be able to access abe, but they are aware of that; if and when 
> >>>> I am reunified, I'll try this out myself.
> >>>
> >>> Not to be cynical or anything, but that unified thing: never worked.
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list