[Swift-devel] Problems running coaster
Mihael Hategan
hategan at mcs.anl.gov
Tue Jul 29 09:38:20 CDT 2008
There is no order issue. When the service is started the exact list of
jars to be used is supplied rather than "all jars in this directory".
On Tue, 2008-07-29 at 09:29 -0500, Michael Wilde wrote:
> I was looking into why my logger.debug statements did not print.
> I am not sure, but suspect, that the updated jar, loaded into
> ~/.globus/coasters/cache, was either not placed in the classpath at
> runtime was was placed after the older copy in the same directory.
>
> I have not yet found the logic by which newer classes get loaded to the
> server, but suspect there may be an issue here. (Or, as usual, pilot
> error on my part).
>
> The class with the updated logging was WorkerManager:
>
> [wilde at honest3 cache]$ jar tvf
> cog-provider-coaster-0.1-a82e2ac11a74fedfadb9a8168a08b6d5.jar | grep
> WorkerManager
> 869 Mon Jul 28 19:10:34 CDT 2008
> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class
> 15556 Mon Jul 28 19:10:34 CDT 2008
> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class
> [wilde at honest3 cache]$ jar tvf
> cog-provider-coaster-0.1-d903eecc754a2c97fb5ceaebdce6ccad.jar | grep
> WorkerManager
> 869 Mon Jul 28 23:54:24 CDT 2008
> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class
> 15963 Mon Jul 28 23:54:24 CDT 2008
> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class
> [wilde at honest3 cache]$
>
> The *ad.jar file has the correct updated class; the *d5.jar file has the
> original unmodified class.
>
> --
>
> If my suspicion about the classpath order is correct, then there is
> greater possibility that there may be a race in the job launching code
> of WorkerManager, as this means that the same code hung once and worked
> once (I'll test more on abe to investigate).
>
> - Mike
>
>
>
> On 7/29/08 12:06 AM, Michael Wilde wrote:
> > hmmm. my debug statement didnt print. but this time the job on abe ran ok.
> >
> > Tomorrow I'll run more tests and see how stable it is there, and why my
> > logging calls never showed up.
> >
> > - Mike
> >
> >
> > On 7/28/08 11:45 PM, Michael Wilde wrote:
> >> Ive moved on, and put a temp hack in to not use -l and instead run
> >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
> >>
> >> .myetcprofile on abe is /etc/profile with the problematic code removed.
> >>
> >> Now abe gets past the problem and runs bootstrap.sh ok.
> >>
> >> The sequence runs OK up to the point where the service on abe's
> >> headnode receives a message to start a job.
> >>
> >> AT this point, the service on abe seems to hang.
> >>
> >> Comparing to the message sequence on mercury, which works, I see this:
> >>
> >> *** mercury:
> >>
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2
> >> SUBMITJOB(identity=1217268111318
> >> executable=/bin/bash
> >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
> >> arg=shared/wrapper.sh
> >> arg=echo-myx2e6xi
> >> arg=-jobdir
> >> arg=m
> >> arg=-e
> >> arg=/bin/echo
> >> arg=-out
> >> arg=echo_s000.txt
> >> arg=-err
> >> arg=stderr.txt
> >> arg=-i
> >> arg=-d
> >> ar)
> >> [ChannelManager] DEBUG Channel multiplexer -
> >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >> [ChannelManager] DEBUG Channel multiplexer - Found
> >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2
> >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
> >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin =
> >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
> >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker
> >> found. Attempting to start a new one.
> >> [WorkerManager] INFO Worker Manager - Got allocation request:
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
> >>
> >> [WorkerManager] INFO Worker Manager - Starting worker with
> >> id=-615912369 and maxwalltime=6060s
> >> Worker start provider: gt2
> >> Worker start JM: pbs
> >>
> >> *** abe:
> >>
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2
> >> SUBMITJOB(identity=1217291444315
> >> executable=/bin/bash
> >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
> >> arg=shared/wrapper.sh
> >> arg=echo-zc5mt6xi
> >> arg=-jobdir
> >> arg=z
> >> arg=-e
> >> arg=/bin/echo
> >> arg=-out
> >> arg=echo_s000.txt
> >> arg=-err
> >> arg=stderr.txt
> >> arg=-i
> >> arg=-d
> >> arg=
> >> ar)
> >> [ChannelManager] DEBUG Channel multiplexer -
> >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >> [ChannelManager] DEBUG Channel multiplexer - Found
> >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2
> >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
> >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin =
> >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
> >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker
> >> found. Attempting to start a new one.
> >> [WorkerManager] INFO Worker Manager - Got allocation request:
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe
> >>
> >> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<:
> >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
> >>
> >> ***
> >>
> >> I *think* the SHUTDOWNSERVICE message on abe is coming much later,
> >> after abe's service hangs, but Im not sure.
> >>
> >> What it looks like to me is that what should should happen on abe is
> >> this:
> >>
> >> [WorkerManager] INFO Worker Manager - Got allocation request:
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
> >>
> >> [WorkerManager] INFO Worker Manager - Starting worker with
> >> id=-615912369 and maxwalltime=6060s
> >>
> >> but on abe the "Worker Manager - Starting worker" is never seen.
> >>
> >> Looking at WorkerManager.run() its hard to see how the "Starting
> >> worker" message could *not* show up right after "Got allocation
> >> request", but there must be some sequence of events that causes this.
> >>
> >> Abe is an 8-core system. Is there perhaps more opportunity for a
> >> multi-thread race or deadlock that could cause this?
> >>
> >> I will insert some more debug logging and try a few more times to see
> >> if thing shang in this manner every time or not.
> >>
> >> - Mike
> >>
> >> ps client Logs with abe server side boot logs are on CI net in
> >> ~wilde/coast/run11
> >>
> >>
> >>
> >> On 7/28/08 10:50 PM, Mihael Hategan wrote:
> >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
> >>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
> >>>>
> >>>>> So it looks like something in the job specs that is launching
> >>>>> coaster for
> >>>>> gt2:pbs is not being accepted by abe.
> >>>> ok. TeraGrid's unified account system is insufficiently unified for
> >>>> me to be able to access abe, but they are aware of that; if and when
> >>>> I am reunified, I'll try this out myself.
> >>>
> >>> Not to be cynical or anything, but that unified thing: never worked.
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list