[Swift-devel] Problems running coaster

Tue Jul 29 09:47:17 CDT 2008

On Tue, 2008-07-29 at 09:38 -0500, Michael Wilde wrote:
> What are some other possibilities of why the logging code didnt work?
> 
> I see the logger.debug calls in the .class file. The logger calls were 
> mostly unconditional. Possibly a different code path, but less likely.

I think the issue here is that the remote log4j doesn't exist or is
different. It's something I've been meaning to deal with.

> 
> I will try clearing the cache and re-running.

I don't think that will help much. The odds of you having found a
collision in MD5 is fairly low.

> 
> - Mike
> 
> 
> On 7/29/08 9:38 AM, Mihael Hategan wrote:
> > There is no order issue. When the service is started the exact list of
> > jars to be used is supplied rather than "all jars in this directory".
> > 
> > On Tue, 2008-07-29 at 09:29 -0500, Michael Wilde wrote:
> >> I was looking into why my logger.debug statements did not print.
> >> I am not sure, but suspect, that the updated jar, loaded into 
> >> ~/.globus/coasters/cache, was either not placed in the classpath at 
> >> runtime was was placed after the older copy in the same directory.
> >>
> >> I have not yet found the logic by which newer classes get loaded to the 
> >> server, but suspect there may be an issue here. (Or, as usual, pilot 
> >> error on my part).
> >>
> >> The class with the updated logging was WorkerManager:
> >>
> >> [wilde at honest3 cache]$ jar tvf 
> >> cog-provider-coaster-0.1-a82e2ac11a74fedfadb9a8168a08b6d5.jar | grep 
> >> WorkerManager
> >>     869 Mon Jul 28 19:10:34 CDT 2008 
> >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class
> >>   15556 Mon Jul 28 19:10:34 CDT 2008 
> >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class
> >> [wilde at honest3 cache]$ jar tvf 
> >> cog-provider-coaster-0.1-d903eecc754a2c97fb5ceaebdce6ccad.jar | grep 
> >> WorkerManager
> >>     869 Mon Jul 28 23:54:24 CDT 2008 
> >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class
> >>   15963 Mon Jul 28 23:54:24 CDT 2008 
> >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class
> >> [wilde at honest3 cache]$
> >>
> >> The *ad.jar file has the correct updated class; the *d5.jar file has the 
> >> original unmodified class.
> >>
> >> --
> >>
> >> If my suspicion about the classpath order is correct, then there is 
> >> greater possibility that there may be a race in the job launching code 
> >> of WorkerManager, as this means that the same code hung once and worked 
> >> once (I'll test more on abe to investigate).
> >>
> >> - Mike
> >>
> >>
> >>
> >> On 7/29/08 12:06 AM, Michael Wilde wrote:
> >>> hmmm. my debug statement didnt print. but this time the job on abe ran ok.
> >>>
> >>> Tomorrow I'll run more tests and see how stable it is there, and why my 
> >>> logging calls never showed up.
> >>>
> >>> - Mike
> >>>
> >>>
> >>> On 7/28/08 11:45 PM, Michael Wilde wrote:
> >>>> Ive moved on, and put a temp hack in to not use -l and instead run 
> >>>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
> >>>>
> >>>> .myetcprofile on abe is /etc/profile with the problematic code removed.
> >>>>
> >>>> Now abe gets past the problem and runs bootstrap.sh ok.
> >>>>
> >>>> The sequence runs OK up to the point where the service on abe's 
> >>>> headnode  receives a message to start a job.
> >>>>
> >>>> AT this point, the service on abe seems to hang.
> >>>>
> >>>> Comparing to the message sequence on mercury, which works, I see this:
> >>>>
> >>>> *** mercury:
> >>>>
> >>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
> >>>> SUBMITJOB(identity=1217268111318
> >>>> executable=/bin/bash
> >>>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
> >>>> arg=shared/wrapper.sh
> >>>> arg=echo-myx2e6xi
> >>>> arg=-jobdir
> >>>> arg=m
> >>>> arg=-e
> >>>> arg=/bin/echo
> >>>> arg=-out
> >>>> arg=echo_s000.txt
> >>>> arg=-err
> >>>> arg=stderr.txt
> >>>> arg=-i
> >>>> arg=-d
> >>>> ar)
> >>>> [ChannelManager] DEBUG Channel multiplexer  -
> >>>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >>>> [ChannelManager] DEBUG Channel multiplexer  - Found 
> >>>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> >>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
> >>>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
> >>>> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
> >>>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
> >>>> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> >>>> found. Attempting to start a new one.
> >>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
> >>>>
> >>>> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> >>>> id=-615912369 and maxwalltime=6060s
> >>>> Worker start provider: gt2
> >>>> Worker start JM: pbs
> >>>>
> >>>> *** abe:
> >>>>
> >>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
> >>>> SUBMITJOB(identity=1217291444315
> >>>> executable=/bin/bash
> >>>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
> >>>> arg=shared/wrapper.sh
> >>>> arg=echo-zc5mt6xi
> >>>> arg=-jobdir
> >>>> arg=z
> >>>> arg=-e
> >>>> arg=/bin/echo
> >>>> arg=-out
> >>>> arg=echo_s000.txt
> >>>> arg=-err
> >>>> arg=stderr.txt
> >>>> arg=-i
> >>>> arg=-d
> >>>> arg=
> >>>> ar)
> >>>> [ChannelManager] DEBUG Channel multiplexer  -
> >>>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >>>> [ChannelManager] DEBUG Channel multiplexer  - Found 
> >>>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> >>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
> >>>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
> >>>> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
> >>>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
> >>>> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> >>>> found. Attempting to start a new one.
> >>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe 
> >>>>
> >>>> [AbstractKarajanChannel] DEBUG Channel multiplexer  - GSSC-null REQ<: 
> >>>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
> >>>>
> >>>> ***
> >>>>
> >>>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, 
> >>>> after abe's service hangs, but Im not sure.
> >>>>
> >>>> What it looks like to me is that what should should happen on abe is 
> >>>> this:
> >>>>
> >>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
> >>>>
> >>>> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> >>>> id=-615912369 and maxwalltime=6060s
> >>>>
> >>>> but on abe the "Worker Manager  - Starting worker" is never seen.
> >>>>
> >>>> Looking at WorkerManager.run() its hard to see how the "Starting 
> >>>> worker" message could *not* show up right after "Got allocation 
> >>>> request", but there must be some sequence of events that causes this.
> >>>>
> >>>> Abe is an 8-core system. Is there perhaps more opportunity for a 
> >>>> multi-thread race or deadlock that could cause this?
> >>>>
> >>>> I will insert some more debug logging and try a few more times to see 
> >>>> if  thing shang in this manner every time or not.
> >>>>
> >>>> - Mike
> >>>>
> >>>> ps client Logs with abe server side boot logs are on CI net in 
> >>>> ~wilde/coast/run11
> >>>>
> >>>>
> >>>>
> >>>> On 7/28/08 10:50 PM, Mihael Hategan wrote:
> >>>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
> >>>>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
> >>>>>>
> >>>>>>> So it looks like something in the job specs that is launching 
> >>>>>>> coaster for
> >>>>>>> gt2:pbs is not being accepted by abe.
> >>>>>> ok. TeraGrid's unified account system is insufficiently unified for 
> >>>>>> me to be able to access abe, but they are aware of that; if and when 
> >>>>>> I am reunified, I'll try this out myself.
> >>>>> Not to be cynical or anything, but that unified thing: never worked.
> >>>>>
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >