[Swift-devel] Problems running coaster
Michael Wilde
wilde at mcs.anl.gov
Tue Jul 29 13:23:01 CDT 2008
Some more details on this issue, which is different than I previously
thought.
Short summary: do we have a /dev/random entropy problem here?
details:
After running many more 1-job tests, I see that they are all working on
Abe. What caused the behavior that I reported below, where the coaster
service *seems* to hang, is in fact a long time delay. It seems like
about 5 minutes between the message sequence below:
--
[WorkerManager] INFO Coaster Queue Processor - No suitable worker
found. Attempting to start a new one.
[WorkerManager] INFO Worker Manager - Got allocation request:
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe
<about 5 minute gap here>
[WorkerManager] INFO Worker Manager - Starting worker with
id=1391818162 and maxwalltime=6060s
Worker start provider: gt2
Worker start JM: pbs
--
(Timestamps in the coaster-boot*.log files would be useful).
It seems like about 5 minutes, as I get between 5-6 of these message on
stdout/err on the client side:
Progress: Executing:1
In those 5 minutes, it doesnt *seem* that the job to start a worker has
been sent to the server, as seen by qstat.
Thats what led me last night to think that the sever was hung here.
One possibility is that that impression above is falsely created by
buffering and other time lags. I am looking at the log via tail -f, so
if the message "[WorkerManager] INFO Worker Manager - Starting worker"
is buffered, that would give a misleading impression that there was a
delay. This could be coupled by a lag in qstat reporting job existence,
which Ive never seen on other PBS systems, but I have seen curious
delays in abe qstat reporting job state changes.
Another possibility is the /dev/random delay in generating an id due ot
lack of server entropy. Now *that* would explain things, as its right
where the delay is occurring:
private void startWorker(int maxWallTime, Task prototype)
throws InvalidServiceContactException {
int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<<
if (logger.isInfoEnabled()) {
logger.info("Starting worker with id=" + id + " and
}
which uses SecureRandom.getInstance("SHA1PRNG")
This just occurred to me and is perhaps a more likely explanation. Is
this the same scenario that was causing the Swift client to encounter
long delays as it started trivial workflows? How was that eventually fixed?
I can stub this out with a simple number generator and test. And/or time
SecureRandom in a standalone program.
- Mike
On 7/29/08 12:06 AM, Michael Wilde wrote:
> hmmm. my debug statement didnt print. but this time the job on abe ran ok.
>
> Tomorrow I'll run more tests and see how stable it is there, and why my
> logging calls never showed up.
>
> - Mike
>
>
> On 7/28/08 11:45 PM, Michael Wilde wrote:
>> Ive moved on, and put a temp hack in to not use -l and instead run
>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
>>
>> .myetcprofile on abe is /etc/profile with the problematic code removed.
>>
>> Now abe gets past the problem and runs bootstrap.sh ok.
>>
>> The sequence runs OK up to the point where the service on abe's
>> headnode receives a message to start a job.
>>
>> AT this point, the service on abe seems to hang.
>>
>> Comparing to the message sequence on mercury, which works, I see this:
>>
>> *** mercury:
>>
>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2
>> SUBMITJOB(identity=1217268111318
>> executable=/bin/bash
>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
>> arg=shared/wrapper.sh
>> arg=echo-myx2e6xi
>> arg=-jobdir
>> arg=m
>> arg=-e
>> arg=/bin/echo
>> arg=-out
>> arg=echo_s000.txt
>> arg=-err
>> arg=stderr.txt
>> arg=-i
>> arg=-d
>> ar)
>> [ChannelManager] DEBUG Channel multiplexer -
>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
>> [ChannelManager] DEBUG Channel multiplexer - Found
>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2
>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin =
>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker
>> found. Attempting to start a new one.
>> [WorkerManager] INFO Worker Manager - Got allocation request:
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
>>
>> [WorkerManager] INFO Worker Manager - Starting worker with
>> id=-615912369 and maxwalltime=6060s
>> Worker start provider: gt2
>> Worker start JM: pbs
>>
>> *** abe:
>>
>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2
>> SUBMITJOB(identity=1217291444315
>> executable=/bin/bash
>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
>> arg=shared/wrapper.sh
>> arg=echo-zc5mt6xi
>> arg=-jobdir
>> arg=z
>> arg=-e
>> arg=/bin/echo
>> arg=-out
>> arg=echo_s000.txt
>> arg=-err
>> arg=stderr.txt
>> arg=-i
>> arg=-d
>> arg=
>> ar)
>> [ChannelManager] DEBUG Channel multiplexer -
>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
>> [ChannelManager] DEBUG Channel multiplexer - Found
>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2
>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin =
>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker
>> found. Attempting to start a new one.
>> [WorkerManager] INFO Worker Manager - Got allocation request:
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe
>>
>> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<:
>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
>>
>> ***
>>
>> I *think* the SHUTDOWNSERVICE message on abe is coming much later,
>> after abe's service hangs, but Im not sure.
>>
>> What it looks like to me is that what should should happen on abe is
>> this:
>>
>> [WorkerManager] INFO Worker Manager - Got allocation request:
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803
>>
>> [WorkerManager] INFO Worker Manager - Starting worker with
>> id=-615912369 and maxwalltime=6060s
>>
>> but on abe the "Worker Manager - Starting worker" is never seen.
>>
>> Looking at WorkerManager.run() its hard to see how the "Starting
>> worker" message could *not* show up right after "Got allocation
>> request", but there must be some sequence of events that causes this.
>>
>> Abe is an 8-core system. Is there perhaps more opportunity for a
>> multi-thread race or deadlock that could cause this?
>>
>> I will insert some more debug logging and try a few more times to see
>> if thing shang in this manner every time or not.
>>
>> - Mike
>>
>> ps client Logs with abe server side boot logs are on CI net in
>> ~wilde/coast/run11
>>
>>
>>
>> On 7/28/08 10:50 PM, Mihael Hategan wrote:
>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
>>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
>>>>
>>>>> So it looks like something in the job specs that is launching
>>>>> coaster for
>>>>> gt2:pbs is not being accepted by abe.
>>>> ok. TeraGrid's unified account system is insufficiently unified for
>>>> me to be able to access abe, but they are aware of that; if and when
>>>> I am reunified, I'll try this out myself.
>>>
>>> Not to be cynical or anything, but that unified thing: never worked.
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list