[Swift-devel] Problems running coaster

Michael Wilde wilde at mcs.anl.gov
Mon Jul 28 23:45:45 CDT 2008


Ive moved on, and put a temp hack in to not use -l and instead run 
"~/.myetcprofile" if it exists and /etc/profile if it doesnt.

.myetcprofile on abe is /etc/profile with the problematic code removed.

Now abe gets past the problem and runs bootstrap.sh ok.

The sequence runs OK up to the point where the service on abe's headnode 
  receives a message to start a job.

AT this point, the service on abe seems to hang.

Comparing to the message sequence on mercury, which works, I see this:

*** mercury:

[RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
SUBMITJOB(identity=1217268111318
executable=/bin/bash
directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
arg=shared/wrapper.sh
arg=echo-myx2e6xi
arg=-jobdir
arg=m
arg=-e
arg=/bin/echo
arg=-out
arg=echo_s000.txt
arg=-err
arg=stderr.txt
arg=-i
arg=-d
ar)
[ChannelManager] DEBUG Channel multiplexer  -
Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
[ChannelManager] DEBUG Channel multiplexer  - Found 
-134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
[RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
[Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
[WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
found. Attempting to start a new one.
[WorkerManager] INFO  Worker Manager  - Got allocation request: 
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
[WorkerManager] INFO  Worker Manager  - Starting worker with 
id=-615912369 and maxwalltime=6060s
Worker start provider: gt2
Worker start JM: pbs

*** abe:

[RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
SUBMITJOB(identity=1217291444315
executable=/bin/bash
directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
arg=shared/wrapper.sh
arg=echo-zc5mt6xi
arg=-jobdir
arg=z
arg=-e
arg=/bin/echo
arg=-out
arg=echo_s000.txt
arg=-err
arg=stderr.txt
arg=-i
arg=-d
arg=
ar)
[ChannelManager] DEBUG Channel multiplexer  -
Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
[ChannelManager] DEBUG Channel multiplexer  - Found 
17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
[RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
[Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
[WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
found. Attempting to start a new one.
[WorkerManager] INFO  Worker Manager  - Got allocation request: 
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe
[AbstractKarajanChannel] DEBUG Channel multiplexer  - GSSC-null REQ<: 
tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE

***

I *think* the SHUTDOWNSERVICE message on abe is coming much later, after 
abe's service hangs, but Im not sure.

What it looks like to me is that what should should happen on abe is this:

[WorkerManager] INFO  Worker Manager  - Got allocation request: 
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
[WorkerManager] INFO  Worker Manager  - Starting worker with 
id=-615912369 and maxwalltime=6060s

but on abe the "Worker Manager  - Starting worker" is never seen.

Looking at WorkerManager.run() its hard to see how the "Starting worker" 
message could *not* show up right after "Got allocation request", but 
there must be some sequence of events that causes this.

Abe is an 8-core system. Is there perhaps more opportunity for a 
multi-thread race or deadlock that could cause this?

I will insert some more debug logging and try a few more times to see if 
  thing shang in this manner every time or not.

- Mike

ps client Logs with abe server side boot logs are on CI net in 
~wilde/coast/run11



On 7/28/08 10:50 PM, Mihael Hategan wrote:
> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
>> On Mon, 28 Jul 2008, Michael Wilde wrote:
>>
>>> So it looks like something in the job specs that is launching coaster for
>>> gt2:pbs is not being accepted by abe.
>> ok. TeraGrid's unified account system is insufficiently unified for me to 
>> be able to access abe, but they are aware of that; if and when I am 
>> reunified, I'll try this out myself.
> 
> Not to be cynical or anything, but that unified thing: never worked.
> 



More information about the Swift-devel mailing list