[Swift-devel] Problems running coaster

Michael Wilde wilde at mcs.anl.gov
Tue Jul 29 00:06:42 CDT 2008


hmmm. my debug statement didnt print. but this time the job on abe ran ok.

Tomorrow I'll run more tests and see how stable it is there, and why my 
logging calls never showed up.

- Mike


On 7/28/08 11:45 PM, Michael Wilde wrote:
> Ive moved on, and put a temp hack in to not use -l and instead run 
> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
> 
> .myetcprofile on abe is /etc/profile with the problematic code removed.
> 
> Now abe gets past the problem and runs bootstrap.sh ok.
> 
> The sequence runs OK up to the point where the service on abe's headnode 
>  receives a message to start a job.
> 
> AT this point, the service on abe seems to hang.
> 
> Comparing to the message sequence on mercury, which works, I see this:
> 
> *** mercury:
> 
> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
> SUBMITJOB(identity=1217268111318
> executable=/bin/bash
> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
> arg=shared/wrapper.sh
> arg=echo-myx2e6xi
> arg=-jobdir
> arg=m
> arg=-e
> arg=/bin/echo
> arg=-out
> arg=echo_s000.txt
> arg=-err
> arg=stderr.txt
> arg=-i
> arg=-d
> ar)
> [ChannelManager] DEBUG Channel multiplexer  -
> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> [ChannelManager] DEBUG Channel multiplexer  - Found 
> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> found. Attempting to start a new one.
> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
> 
> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> id=-615912369 and maxwalltime=6060s
> Worker start provider: gt2
> Worker start JM: pbs
> 
> *** abe:
> 
> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
> SUBMITJOB(identity=1217291444315
> executable=/bin/bash
> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
> arg=shared/wrapper.sh
> arg=echo-zc5mt6xi
> arg=-jobdir
> arg=z
> arg=-e
> arg=/bin/echo
> arg=-out
> arg=echo_s000.txt
> arg=-err
> arg=stderr.txt
> arg=-i
> arg=-d
> arg=
> ar)
> [ChannelManager] DEBUG Channel multiplexer  -
> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> [ChannelManager] DEBUG Channel multiplexer  - Found 
> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> found. Attempting to start a new one.
> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe 
> 
> [AbstractKarajanChannel] DEBUG Channel multiplexer  - GSSC-null REQ<: 
> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
> 
> ***
> 
> I *think* the SHUTDOWNSERVICE message on abe is coming much later, after 
> abe's service hangs, but Im not sure.
> 
> What it looks like to me is that what should should happen on abe is this:
> 
> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
> 
> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> id=-615912369 and maxwalltime=6060s
> 
> but on abe the "Worker Manager  - Starting worker" is never seen.
> 
> Looking at WorkerManager.run() its hard to see how the "Starting worker" 
> message could *not* show up right after "Got allocation request", but 
> there must be some sequence of events that causes this.
> 
> Abe is an 8-core system. Is there perhaps more opportunity for a 
> multi-thread race or deadlock that could cause this?
> 
> I will insert some more debug logging and try a few more times to see if 
>  thing shang in this manner every time or not.
> 
> - Mike
> 
> ps client Logs with abe server side boot logs are on CI net in 
> ~wilde/coast/run11
> 
> 
> 
> On 7/28/08 10:50 PM, Mihael Hategan wrote:
>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
>>>
>>>> So it looks like something in the job specs that is launching 
>>>> coaster for
>>>> gt2:pbs is not being accepted by abe.
>>> ok. TeraGrid's unified account system is insufficiently unified for 
>>> me to be able to access abe, but they are aware of that; if and when 
>>> I am reunified, I'll try this out myself.
>>
>> Not to be cynical or anything, but that unified thing: never worked.
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list