[Swift-devel] Coaster CPU-time consumption issue
Michael Wilde
wilde at mcs.anl.gov
Mon Jul 13 14:11:34 CDT 2009
On 7/13/09 2:06 PM, Mihael Hategan wrote:
> A while ago I committed a patch to run the service process with a lower
> priority. Is that in use?
Looks like 22395 is running with a nice value of 10 which I think is
what you set in that patch: 22395 aespinos 25 10
>
> Also, is logging reduced or is it the default?
>
> Is the 97% CPU usage a spike, or does it stay there on average?
>
> Can I take a look at the coaster logs from skenny's run on ranger?
>
> I'd also like to point out in as little offensive mode as I can, that
> I'm working 100% on I2U2 and my lack of getting more than lightly
> involved in this is a consequence of that.
Right, understood. Any pointers you can give are welcome, and Allan and
I are expecting to do the legwork. We'll at least try to find out where
the overhead is coming from.
- Mike
>
> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>> I ran 2000 "sleep 60" jobs on teraport and monitored tp-osg. From
>> here process 22395 is the child of the main java process
>> (bootstrap.jar) and is loading the CPU.
>>
>> I have coasters.log, worker-*log, swift logs, gram logs in
>> ~aespinosa/workflows/activelog/run06. This refers to a different run.
>> PID 15206 is the child java process of bootstrap.jar in here.
>>
>> top snapshot:
>> top - 13:49:03 up 55 days, 1:45, 1 user, load average: 1.18, 0.80, 0.55
>> Tasks: 121 total, 1 running, 120 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 7.5%us, 2.8%sy, 48.7%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>> Mem: 4058916k total, 3889864k used, 169052k free, 239688k buffers
>> Swap: 4192956k total, 96k used, 4192860k free, 2504812k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 22395 aespinos 25 10 525m 91m 13m S 97.5 2.3 4:29.22 java
>> 22217 aespinos 15 0 10736 1048 776 R 0.3 0.0 0:00.50 top
>> 22243 aespinos 16 0 102m 5576 3536 S 0.3 0.1 0:00.10 globus-job-mana
>> 14764 aespinos 15 0 98024 1744 976 S 0.0 0.0 0:00.06 sshd
>> 14765 aespinos 15 0 65364 2796 1176 S 0.0 0.1 0:00.18 bash
>> 22326 aespinos 18 0 8916 1052 852 S 0.0 0.0 0:00.00 bash
>> 22328 aespinos 19 0 8916 1116 908 S 0.0 0.0 0:00.00 bash
>> 22364 aespinos 15 0 1222m 18m 8976 S 0.0 0.5 0:00.20 java
>> 22444 aespinos 16 0 102m 5684 3528 S 0.0 0.1 0:00.09 globus-job-man
>>
>> ps snapshot:
>>
>> 22328 ? S 0:00 \_ /bin/bash
>> 22364 ? Sl 0:00 \_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>> /tmp/bootstrap.w22332 http://communicado.ci.uchicago.edu:46520
>> https://128.135.125.17:46519 11505253269
>> 22395 ? SNl 6:29 \_
>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>> -DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>> -Djava.security.egd=file:///dev/urandom -cp
>> /home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcdec94
6b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.jar:
/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_se
rvice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>
>>
>>
>> 2009/7/13 Mihael Hategan <hategan at mcs.anl.gov>:
>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>> At the time we did not have a chance to gather detailed evidence, but I
>>>>>> was surprised by two things:
>>>>>>
>>>>>> - that there were two Java processes and that one was so big. (Are most
>>>>>> likely the active process was just a child thread of the main process?)
>>>>> One java process is the bootstrap process (it downloads the coaster
>>>>> jars, sets up the environment and runs the coaster service). It has
>>>>> always been like this. Did you happen to capture the output of ps to a
>>>>> file? That would be useful, because from what you are suggesting, it
>>>>> appears that the bootstrap process is eating 100% CPU. That process
>>>>> should only be sleeping after the service is started.
>>>> I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
>>>> locate it.
>>>>
>>>> As best as I can recall it showed the larger memory-footprint process to
>>>> be relatively idle, and the smaller footprint process (about 275MB) to
>>>> be burning 100% of a CPU.
>>> Normally, the smaller footprint process should be the bootstrap. But
>>> that's why I would like the ps output, because it sounds odd.
>>>
>>>> Allan will try to get a snapshot of this shortly.
>>>>
>>>> If this observation if correct, whats the best way to find out where its
>>>> spinning? Profiling? Debug logging? Can you get profiling data from a
>>>> JVM that doesnt exit?
>>> Once I know where it is, I can look at the code and then we'll go from
>>> there.
>>>
>>>
>>>
>>
>>
>
More information about the Swift-devel
mailing list