[Swift-devel] Problems running coaster

Michael Wilde wilde at mcs.anl.gov
Tue Jul 29 13:48:41 CDT 2008


Ive confirmed the "long pause" problem with the class IDGenerator 
running on the abe login host.

I can gen a few thousand random #s instantly, but once it runs out of 
entropy, the same trivial program just hangs:

//Added to a test copy of IDGenerator.java:
         public static void main(String args[])
         {
            System.out.println("Hello World!");
            IDGenerator gen = new IDGenerator();
            for (int i = 0; i<100; i++ ) {
	       System.out.println("id: " + gen.nextInt());
	  }
         }


at first:
[wilde at honest3 ~/rantest]$ java IDGenerator
Hello World!
id: -1080189798
id: -263746139
id: 947709574
...etc

Then I start another one, and it "just hangs" for several minutes.

[wilde at honest3 ~/rantest]$ time java IDGenerator | wc -l
^^^ this has been hanging for a few minutes.

I also see that genId() can generate negative numbers. This gives some 
log files with -- in their names where others have just -, which is 
confusing and can mess up automation scripts.

Should I bugzilla either or both of these?

- Mike


On 7/29/08 1:23 PM, Michael Wilde wrote:
> Some more details on this issue, which is different than I previously 
> thought.
> 
> Short summary: do we have a /dev/random entropy problem here?
> 
> details:
> 
> After running many more 1-job tests, I see that they are all working on 
> Abe. What caused the behavior that I reported below, where the coaster 
> service *seems* to hang, is in fact a long time delay. It seems like 
> about 5 minutes between the message sequence below:
> -- 
> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
> found. Attempting to start a new one.
> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe 
> 
> 
> <about 5 minute gap here>
> 
> [WorkerManager] INFO  Worker Manager  - Starting worker with 
> id=1391818162 and maxwalltime=6060s
> Worker start provider: gt2
> Worker start JM: pbs
> -- 
> (Timestamps in the coaster-boot*.log files would be useful).
> 
> It seems like about 5 minutes, as I get between 5-6 of these message on 
> stdout/err on the client side:
> Progress:  Executing:1
> 
> In those 5 minutes, it doesnt *seem* that the job to start a worker has 
> been sent to the server, as seen by qstat.
> 
> Thats what led me last night to think that the sever was hung here.
> 
> One possibility is that that impression above is falsely created by 
> buffering and other time lags. I am looking at the log via tail -f, so 
> if the message "[WorkerManager] INFO  Worker Manager  - Starting worker" 
> is buffered, that would give a misleading impression that there was a 
> delay. This could be coupled by a lag in qstat reporting job existence, 
> which Ive never seen on other PBS systems, but I have seen curious 
> delays in abe qstat reporting job state changes.
> 
> Another possibility is the /dev/random delay in generating an id due ot 
> lack of server entropy. Now *that* would explain things, as its right 
> where the delay is occurring:
> 
>     private void startWorker(int maxWallTime, Task prototype)
>             throws InvalidServiceContactException {
>         int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<<
>         if (logger.isInfoEnabled()) {
>             logger.info("Starting worker with id=" + id + " and
>         }
> which uses SecureRandom.getInstance("SHA1PRNG")
> 
> This just occurred to me and is perhaps a more likely explanation. Is 
> this the same scenario that was causing the Swift client to encounter 
> long delays as it started trivial workflows?  How was that eventually 
> fixed?
> 
> I can stub this out with a simple number generator and test. And/or time 
> SecureRandom in a standalone program.
> 
> - Mike
> 
> 
> 
> 
> 
> On 7/29/08 12:06 AM, Michael Wilde wrote:
>> hmmm. my debug statement didnt print. but this time the job on abe ran 
>> ok.
>>
>> Tomorrow I'll run more tests and see how stable it is there, and why 
>> my logging calls never showed up.
>>
>> - Mike
>>
>>
>> On 7/28/08 11:45 PM, Michael Wilde wrote:
>>> Ive moved on, and put a temp hack in to not use -l and instead run 
>>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
>>>
>>> .myetcprofile on abe is /etc/profile with the problematic code removed.
>>>
>>> Now abe gets past the problem and runs bootstrap.sh ok.
>>>
>>> The sequence runs OK up to the point where the service on abe's 
>>> headnode  receives a message to start a job.
>>>
>>> AT this point, the service on abe seems to hang.
>>>
>>> Comparing to the message sequence on mercury, which works, I see this:
>>>
>>> *** mercury:
>>>
>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
>>> SUBMITJOB(identity=1217268111318
>>> executable=/bin/bash
>>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
>>> arg=shared/wrapper.sh
>>> arg=echo-myx2e6xi
>>> arg=-jobdir
>>> arg=m
>>> arg=-e
>>> arg=/bin/echo
>>> arg=-out
>>> arg=echo_s000.txt
>>> arg=-err
>>> arg=stderr.txt
>>> arg=-i
>>> arg=-d
>>> ar)
>>> [ChannelManager] DEBUG Channel multiplexer  -
>>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
>>> [ChannelManager] DEBUG Channel multiplexer  - Found 
>>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
>>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
>>> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
>>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
>>> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
>>> found. Attempting to start a new one.
>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
>>>
>>> [WorkerManager] INFO  Worker Manager  - Starting worker with 
>>> id=-615912369 and maxwalltime=6060s
>>> Worker start provider: gt2
>>> Worker start JM: pbs
>>>
>>> *** abe:
>>>
>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
>>> SUBMITJOB(identity=1217291444315
>>> executable=/bin/bash
>>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
>>> arg=shared/wrapper.sh
>>> arg=echo-zc5mt6xi
>>> arg=-jobdir
>>> arg=z
>>> arg=-e
>>> arg=/bin/echo
>>> arg=-out
>>> arg=echo_s000.txt
>>> arg=-err
>>> arg=stderr.txt
>>> arg=-i
>>> arg=-d
>>> arg=
>>> ar)
>>> [ChannelManager] DEBUG Channel multiplexer  -
>>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
>>> [ChannelManager] DEBUG Channel multiplexer  - Found 
>>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
>>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
>>> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
>>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
>>> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
>>> found. Attempting to start a new one.
>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe 
>>>
>>> [AbstractKarajanChannel] DEBUG Channel multiplexer  - GSSC-null REQ<: 
>>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
>>>
>>> ***
>>>
>>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, 
>>> after abe's service hangs, but Im not sure.
>>>
>>> What it looks like to me is that what should should happen on abe is 
>>> this:
>>>
>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
>>>
>>> [WorkerManager] INFO  Worker Manager  - Starting worker with 
>>> id=-615912369 and maxwalltime=6060s
>>>
>>> but on abe the "Worker Manager  - Starting worker" is never seen.
>>>
>>> Looking at WorkerManager.run() its hard to see how the "Starting 
>>> worker" message could *not* show up right after "Got allocation 
>>> request", but there must be some sequence of events that causes this.
>>>
>>> Abe is an 8-core system. Is there perhaps more opportunity for a 
>>> multi-thread race or deadlock that could cause this?
>>>
>>> I will insert some more debug logging and try a few more times to see 
>>> if  thing shang in this manner every time or not.
>>>
>>> - Mike
>>>
>>> ps client Logs with abe server side boot logs are on CI net in 
>>> ~wilde/coast/run11
>>>
>>>
>>>
>>> On 7/28/08 10:50 PM, Mihael Hategan wrote:
>>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
>>>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
>>>>>
>>>>>> So it looks like something in the job specs that is launching 
>>>>>> coaster for
>>>>>> gt2:pbs is not being accepted by abe.
>>>>> ok. TeraGrid's unified account system is insufficiently unified for 
>>>>> me to be able to access abe, but they are aware of that; if and 
>>>>> when I am reunified, I'll try this out myself.
>>>>
>>>> Not to be cynical or anything, but that unified thing: never worked.
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list