[Swift-devel] Coaster CPU-time consumption issue

Michael Wilde wilde at mcs.anl.gov
Mon Jul 13 18:00:33 CDT 2009


OK. Best to test with Mihael's cog r2429.
I hope that this ends the latest head-node havoc :)

Please post either way, so we know if the other problem remains or not.

Thanks,

- Mike


On 7/13/09 5:41 PM, skenny at uchicago.edu wrote:
> cool, i'll give this a shot now too. 
> 
> it's possible the other err i mentioned to you mike, was
> actually related to the stdout redirection. i wanted to test
> more, but trying not to wreak havoc on the headnode :P anyway,
> if this works, i can do more testing and will post if i'm
> still getting the error.
> 
> ~sk
> 
> ---- Original message ----
>> Date: Mon, 13 Jul 2009 17:17:26 -0500
>> From: Michael Wilde <wilde at mcs.anl.gov>  
>> Subject: Re: [Swift-devel] Coaster CPU-time consumption issue  
>> To: Allan Espinosa <aespinosa at cs.uchicago.edu>, Sarah Kenny
> <skenny at uchicago.edu>
>> Cc: swift-devel <swift-devel at ci.uchicago.edu>
>>
>> Nice!  Now lets beat it up and see how well it works.
>>
>> Sarah: Allan did not encounter the error messages you
> mentioned to me.
>> I suggest you do this:
>>
>> - post to the devel list the messages you got
>>
>> - test this patch to see if it clears up the problem
>>
>> Mike
>>
>>
>> On 7/13/09 5:04 PM, Allan Espinosa wrote:
>>> hi,
>>>
>>> here is a patch which solves the cpu usage on the bootstrap
> coaster
>>> service:
> http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>>> suggested svn log entry:
>>>     Added locks via wait() and notify() to prevent busy
> waiting/
>>> active polling in the block task queue.
>>>
>>>
>>> Test 2000 touch job using 066-many.swift via local:local :
>>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
>>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>>>
>>> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>>>
>>> -Allan
>>>
>>>
>>> 2009/7/13 Michael Wilde <wilde at mcs.anl.gov>:
>>>> Hi Allan,
>>>>
>>>> I think the methods you want for synchronization are part
> of class Object.
>>>> They are documented in the chapter Threads and Locks of
> The Java Language
>>>> Specification:
>>>>
>>>>
> http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>>>> queue.wait() should be called if the queue is empty.
>>>>
>>>> queue.notify() or .notifyall() should be called when
> something is added to
>>>> the queue. I think notify() should work.
>>>>
>>>> .wait will I think take a timer, but suspect you dont need
> that.
>>>> Both should be called within the synchronized(queue)
> constructs that are
>>>> already in the code.
>>>>
>>>> Should be fun to fix this!
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>>>> 97% is an average as can be seen in run06.  swift version
> is r3005 and
>>>>> cogkit r2410.  this is a vanilla build of swift.
>>>>>
>>>>> 2009/7/13 Mihael Hategan <hategan at mcs.anl.gov>:
>>>>>> A while ago I committed a patch to run the service
> process with a lower
>>>>>> priority. Is that in use?
>>>>>>
>>>>>> Also, is logging reduced or is it the default?
>>>>>>
>>>>>> Is the 97% CPU usage a spike, or does it stay there on
> average?
>>>>>> Can I take a look at the coaster logs from skenny's run
> on ranger?
>>>>>> I'd also like to point out in as little offensive mode
> as I can, that
>>>>>> I'm working 100% on I2U2 and my lack of getting more
> than lightly
>>>>>> involved in this is a consequence of that.
>>>>>>
>>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored
> tp-osg.  From
>>>>>>> here process 22395 is the child of the main java process
>>>>>>> (bootstrap.jar) and is loading the CPU.
>>>>>>>
>>>>>>> I have coasters.log, worker-*log, swift logs, gram logs in
>>>>>>> ~aespinosa/workflows/activelog/run06.  This refers to a
> different run.
>>>>>>>  PID 15206 is the child java process of bootstrap.jar
> in here.
>>>>>>> top snapshot:
>>>>>>> top - 13:49:03 up 55 days,  1:45,  1 user,  load
> average: 1.18, 0.80,
>>>>>>> 0.55
>>>>>>> Tasks: 121 total,   1 running, 120 sleeping,   0
> stopped,   0 zombie
>>>>>>> Cpu(s):  7.5%us,  2.8%sy, 48.7%ni, 41.0%id,  0.0%wa, 
> 0.0%hi,  0.0%si,
>>>>>>>  0.0%st
>>>>>>> Mem:   4058916k total,  3889864k used,   169052k free,
>   239688k buffers
>>>>>>> Swap:  4192956k total,       96k used,  4192860k free,
>  2504812k cached
>>>>>>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   
> TIME+  COMMAND
>>>>>>> 22395 aespinos  25  10  525m  91m  13m S 97.5  2.3  
> 4:29.22 java
>>>>>>> 22217 aespinos  15   0 10736 1048  776 R  0.3  0.0  
> 0:00.50 top
>>>>>>> 22243 aespinos  16   0  102m 5576 3536 S  0.3  0.1  
> 0:00.10
>>>>>>> globus-job-mana
>>>>>>> 14764 aespinos  15   0 98024 1744  976 S  0.0  0.0  
> 0:00.06 sshd
>>>>>>> 14765 aespinos  15   0 65364 2796 1176 S  0.0  0.1  
> 0:00.18 bash
>>>>>>> 22326 aespinos  18   0  8916 1052  852 S  0.0  0.0  
> 0:00.00 bash
>>>>>>> 22328 aespinos  19   0  8916 1116  908 S  0.0  0.0  
> 0:00.00 bash
>>>>>>> 22364 aespinos  15   0 1222m  18m 8976 S  0.0  0.5  
> 0:00.20 java
>>>>>>> 22444 aespinos  16   0  102m 5684 3528 S  0.0  0.1  
> 0:00.09
>>>>>>> globus-job-man
>>>>>>>
>>>>>>> ps snapshot:
>>>>>>>
>>>>>>> 22328 ?        S      0:00  \_ /bin/bash
>>>>>>> 22364 ?        Sl     0:00      \_
>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
> -DGLOBUS_TCP_PORT_RANGE=
>>>>>>>
> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>> -DX509_CERT_DIR=
> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>>>>> /tmp/bootstrap.w22332
> http://communicado.ci.uchicago.edu:46520
>>>>>>> https://128.135.125.17:46519 11505253269
>>>>>>> 22395 ?        SNl    6:29          \_
>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>>>>
>>>>>>>
> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>>>>
>>>>>>>
> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde
>> c9
> 46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.
>> jar
> :/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou
>> s_s
> ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>>>>
>>>>>>> 2009/7/13 Mihael Hategan <hategan at mcs.anl.gov>:
>>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>>>> At the time we did not have a chance to gather
> detailed evidence,
>>>>>>>>>>> but I
>>>>>>>>>>> was surprised by two things:
>>>>>>>>>>>
>>>>>>>>>>> - that there were two Java processes and that one
> was so big. (Are
>>>>>>>>>>> most
>>>>>>>>>>> likely the active process was just a child thread
> of the main
>>>>>>>>>>> process?)
>>>>>>>>>> One java process is the bootstrap process (it
> downloads the coaster
>>>>>>>>>> jars, sets up the environment and runs the coaster
> service). It has
>>>>>>>>>> always been like this. Did you happen to capture the
> output of ps to
>>>>>>>>>> a
>>>>>>>>>> file? That would be useful, because from what you
> are suggesting, it
>>>>>>>>>> appears that the bootstrap process is eating 100%
> CPU. That process
>>>>>>>>>> should only be sleeping after the service is started.
>>>>>>>>> I *thought* I captured the output of "top -u
> sarahs'id -b -d" but I
>>>>>>>>> cant
>>>>>>>>> locate it.
>>>>>>>>>
>>>>>>>>> As best as I can recall it showed the larger
> memory-footprint process
>>>>>>>>> to
>>>>>>>>> be relatively idle, and the smaller footprint process
> (about 275MB) to
>>>>>>>>> be burning 100% of a CPU.
>>>>>>>> Normally, the smaller footprint process should be the
> bootstrap. But
>>>>>>>> that's why I would like the ps output, because it
> sounds odd.
>>>>>>>>>  Allan will try to get a snapshot of this shortly.
>>>>>>>>>
>>>>>>>>> If this observation if correct, whats the best way to
> find out where
>>>>>>>>> its
>>>>>>>>> spinning? Profiling? Debug logging? Can you get
> profiling data from a
>>>>>>>>> JVM that doesnt exit?
>>>>>>>> Once I know where it is, I can look at the code and
> then we'll go from
>>>>>>>> there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>
>>>



More information about the Swift-devel mailing list