[Swift-devel] Coasters and std's on ranger

skenny at uchicago.edu skenny at uchicago.edu
Mon Jul 13 21:54:11 CDT 2009


so, here is the swift error i currently get running a 50-job
workflow with the latest code on ranger:

Execution failed:
        Exception in RInvoke:
Arguments: [scripts/4reg_dummy.R,
matrices/4_reg/network1/gestspeech.cov, 31, 0.5, speech]
Host: RANGER
Directory:
4reg_speech-20090713-2127-tbl7ou0e/jobs/f/RInvoke-f57xpmdj
stderr.txt:

stdout.txt:

----

Caused by:
        Block task failed:
org.globus.gram.GramException: The job manager could not stage
out a file
        at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
        at org.globus.gram.GramJob.setStatus(GramJob.java:184)
        at
org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
        at java.lang.Thread.run(Thread.java:619)

Cleaning up...
Shutting down service at https://129.114.50.163:36721
Got channel MetaChannel: 24980848 -> GSSSChannel-null(1)
- Done

gram log shows this:

7/13 21:37:58 JM: sending callback of status 4 (failure code
155) to https://128.135.125.211:50003/1247538475621.
7/13 21:37:58 JMI: testing job manager scripts for type fork
exist and permissions are ok.

this is the same error i was getting on ranger running without
coasters prior to commenting out the redirection of stdout and
stderr (which corrected the error for provider-gt2). is there
a redirection of these std's going on in provider-coaster that
can be corrected somehow?

~sk

p.s. let me know if anyone would like the swift log for this.

---- Original message ----
>Date: Mon, 13 Jul 2009 18:00:33 -0500
>From: Michael Wilde <wilde at mcs.anl.gov>  
>Subject: Re: [Swift-devel] Coaster CPU-time consumption issue  
>To: skenny at uchicago.edu
>Cc: Allan Espinosa <aespinosa at cs.uchicago.edu>, swift-devel
<swift-devel at ci.uchicago.edu>
>
>OK. Best to test with Mihael's cog r2429.
>I hope that this ends the latest head-node havoc :)
>
>Please post either way, so we know if the other problem
remains or not.
>
>Thanks,
>
>- Mike
>
>
>On 7/13/09 5:41 PM, skenny at uchicago.edu wrote:
>> cool, i'll give this a shot now too. 
>> 
>> it's possible the other err i mentioned to you mike, was
>> actually related to the stdout redirection. i wanted to test
>> more, but trying not to wreak havoc on the headnode :P anyway,
>> if this works, i can do more testing and will post if i'm
>> still getting the error.
>> 
>> ~sk
>> 
>> ---- Original message ----
>>> Date: Mon, 13 Jul 2009 17:17:26 -0500
>>> From: Michael Wilde <wilde at mcs.anl.gov>  
>>> Subject: Re: [Swift-devel] Coaster CPU-time consumption
issue  
>>> To: Allan Espinosa <aespinosa at cs.uchicago.edu>, Sarah Kenny
>> <skenny at uchicago.edu>
>>> Cc: swift-devel <swift-devel at ci.uchicago.edu>
>>>
>>> Nice!  Now lets beat it up and see how well it works.
>>>
>>> Sarah: Allan did not encounter the error messages you
>> mentioned to me.
>>> I suggest you do this:
>>>
>>> - post to the devel list the messages you got
>>>
>>> - test this patch to see if it clears up the problem
>>>
>>> Mike
>>>
>>>
>>> On 7/13/09 5:04 PM, Allan Espinosa wrote:
>>>> hi,
>>>>
>>>> here is a patch which solves the cpu usage on the bootstrap
>> coaster
>>>> service:
>>
http://www.ci.uchicago.edu/~aespinosa/provider-coaster-cpu_fix.patch
>>>> suggested svn log entry:
>>>>     Added locks via wait() and notify() to prevent busy
>> waiting/
>>>> active polling in the block task queue.
>>>>
>>>>
>>>> Test 2000 touch job using 066-many.swift via local:local :
>>>> before: http://www.ci.uchicago.edu/~aespinosa/swift/run06
>>>> after: http://www.ci.uchicago.edu/~aespinosa/swift/run07
>>>>
>>>> CPU usage drops from 100% to 0% with a few 25-40 % spikes!
>>>>
>>>> -Allan
>>>>
>>>>
>>>> 2009/7/13 Michael Wilde <wilde at mcs.anl.gov>:
>>>>> Hi Allan,
>>>>>
>>>>> I think the methods you want for synchronization are part
>> of class Object.
>>>>> They are documented in the chapter Threads and Locks of
>> The Java Language
>>>>> Specification:
>>>>>
>>>>>
>>
http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.8
>>>>> queue.wait() should be called if the queue is empty.
>>>>>
>>>>> queue.notify() or .notifyall() should be called when
>> something is added to
>>>>> the queue. I think notify() should work.
>>>>>
>>>>> .wait will I think take a timer, but suspect you dont need
>> that.
>>>>> Both should be called within the synchronized(queue)
>> constructs that are
>>>>> already in the code.
>>>>>
>>>>> Should be fun to fix this!
>>>>>
>>>>> - Mike
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 7/13/09 2:12 PM, Allan Espinosa wrote:
>>>>>> 97% is an average as can be seen in run06.  swift version
>> is r3005 and
>>>>>> cogkit r2410.  this is a vanilla build of swift.
>>>>>>
>>>>>> 2009/7/13 Mihael Hategan <hategan at mcs.anl.gov>:
>>>>>>> A while ago I committed a patch to run the service
>> process with a lower
>>>>>>> priority. Is that in use?
>>>>>>>
>>>>>>> Also, is logging reduced or is it the default?
>>>>>>>
>>>>>>> Is the 97% CPU usage a spike, or does it stay there on
>> average?
>>>>>>> Can I take a look at the coaster logs from skenny's run
>> on ranger?
>>>>>>> I'd also like to point out in as little offensive mode
>> as I can, that
>>>>>>> I'm working 100% on I2U2 and my lack of getting more
>> than lightly
>>>>>>> involved in this is a consequence of that.
>>>>>>>
>>>>>>> On Mon, 2009-07-13 at 13:55 -0500, Allan Espinosa wrote:
>>>>>>>> I ran 2000 "sleep 60" jobs on teraport and monitored
>> tp-osg.  From
>>>>>>>> here process 22395 is the child of the main java process
>>>>>>>> (bootstrap.jar) and is loading the CPU.
>>>>>>>>
>>>>>>>> I have coasters.log, worker-*log, swift logs, gram
logs in
>>>>>>>> ~aespinosa/workflows/activelog/run06.  This refers to a
>> different run.
>>>>>>>>  PID 15206 is the child java process of bootstrap.jar
>> in here.
>>>>>>>> top snapshot:
>>>>>>>> top - 13:49:03 up 55 days,  1:45,  1 user,  load
>> average: 1.18, 0.80,
>>>>>>>> 0.55
>>>>>>>> Tasks: 121 total,   1 running, 120 sleeping,   0
>> stopped,   0 zombie
>>>>>>>> Cpu(s):  7.5%us,  2.8%sy, 48.7%ni, 41.0%id,  0.0%wa, 
>> 0.0%hi,  0.0%si,
>>>>>>>>  0.0%st
>>>>>>>> Mem:   4058916k total,  3889864k used,   169052k free,
>>   239688k buffers
>>>>>>>> Swap:  4192956k total,       96k used,  4192860k free,
>>  2504812k cached
>>>>>>>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   
>> TIME+  COMMAND
>>>>>>>> 22395 aespinos  25  10  525m  91m  13m S 97.5  2.3  
>> 4:29.22 java
>>>>>>>> 22217 aespinos  15   0 10736 1048  776 R  0.3  0.0  
>> 0:00.50 top
>>>>>>>> 22243 aespinos  16   0  102m 5576 3536 S  0.3  0.1  
>> 0:00.10
>>>>>>>> globus-job-mana
>>>>>>>> 14764 aespinos  15   0 98024 1744  976 S  0.0  0.0  
>> 0:00.06 sshd
>>>>>>>> 14765 aespinos  15   0 65364 2796 1176 S  0.0  0.1  
>> 0:00.18 bash
>>>>>>>> 22326 aespinos  18   0  8916 1052  852 S  0.0  0.0  
>> 0:00.00 bash
>>>>>>>> 22328 aespinos  19   0  8916 1116  908 S  0.0  0.0  
>> 0:00.00 bash
>>>>>>>> 22364 aespinos  15   0 1222m  18m 8976 S  0.0  0.5  
>> 0:00.20 java
>>>>>>>> 22444 aespinos  16   0  102m 5684 3528 S  0.0  0.1  
>> 0:00.09
>>>>>>>> globus-job-man
>>>>>>>>
>>>>>>>> ps snapshot:
>>>>>>>>
>>>>>>>> 22328 ?        S      0:00  \_ /bin/bash
>>>>>>>> 22364 ?        Sl     0:00      \_
>>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>>>>>>>> -Djava=/opt/osg-ce-1.0.0-r2/jdk1.5/bin/java
>> -DGLOBUS_TCP_PORT_RANGE=
>>>>>>>>
>>
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>>> -DX509_CERT_DIR=
>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu -jar
>>>>>>>> /tmp/bootstrap.w22332
>> http://communicado.ci.uchicago.edu:46520
>>>>>>>> https://128.135.125.17:46519 11505253269
>>>>>>>> 22395 ?        SNl    6:29          \_
>>>>>>>> /opt/osg-ce-1.0.0-r2/jdk1.5/bin/java -Xmx128M
>>>>>>>>
>>>>>>>>
>>
-DX509_USER_PROXY=/home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/22243.1247510668/x509_up
>>>>>>>> -DGLOBUS_HOSTNAME=tp-osg.ci.uchicago.edu
>>>>>>>> -Djava.security.egd=file:///dev/urandom -cp
>>>>>>>>
>>>>>>>>
>>
/home/aespinosa/.globus/coasters/cache/cog-provider-coaster-0.3-e10824578d296f9eebba24f209dbed7b.jar:/home/aespinosa/.globus/coasters/cache/backport-util-concurrent-f9c59530e5d6ca38f3ba6c0b6213e016.jar:/home/aespinosa/.globus/coasters/cache/cog-abstraction-common-2.3-6f32fedfa8ec01e07d0096a5275ac24b.jar:/home/aespinosa/.globus/coasters/cache/cog-jglobus-dev-080222-d87a8fb09be6d8011f6492feabce475d.jar:/home/aespinosa/.globus/coasters/cache/cog-karajan-0.36-dev-1614e96028db0b862d84fa01e5998872.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt2-2.4-b8ed4d13933b4c28a7e6844a39a4fad3.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-gt4_0_0-2.5-658d743844fc772713ac2aa6f92b00e7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-local-2.2-60cc44c1599e1298376036bb9dc531c7.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-localscheduler-0.4-c5eeab454e4fe4a99421e28761566bf1.jar:/home/aespinosa/.globus/coasters/cache/cog-provider-ssh-2.4-74d67b067ac196fcde
>>> c9
>>
46b9584af140.jar:/home/aespinosa/.globus/coasters/cache/cog-util-0.92-7b1f1e2bf52a6b575948e3f8949fa1df.jar:/home/aespinosa/.globus/coasters/cache/cryptix-asn1-87c4cf848c81d102bd29e33681b80e8a.jar:/home/aespinosa/.globus/coasters/cache/cryptix-c3dad86be114c7aaf2ddf32c8e52184a.jar:/home/aespinosa/.globus/coasters/cache/cryptix32-59772ad239684bf10ae8fe71f4dbae22.jar:/home/aespinosa/.globus/coasters/cache/concurrent-967678fe1b153be98d982e3867e7271b.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-core-0.2.2-patched-9bf1ffb8ab700234649f70ef4a35f029.jar:/home/aespinosa/.globus/coasters/cache/j2ssh-common-0.2.2-d65a51ea6f64efc066915c1618c613ca.jar:/home/aespinosa/.globus/coasters/cache/jaxrpc-8e7d80b5d77dff6ed2f41352e9147101.jar:/home/aespinosa/.globus/coasters/cache/jce-jdk13-131-06fc7049669d16c4001a452e100b401f.jar:/home/aespinosa/.globus/coasters/cache/jgss-9cccfd21259791b509af229a0181f207.jar:/home/aespinosa/.globus/coasters/cache/log4j-1.2.8-18a4ca847248e5b8606325684342701c.
>>> jar
>>
:/home/aespinosa/.globus/coasters/cache/puretls-90b9c31c201243b9f4a24fa11d404702.jar:/home/aespinosa/.globus/coasters/cache/addressing-1.0-44c19ed929b7d8ab75812b7cd60753c7.jar:/home/aespinosa/.globus/coasters/cache/commonj-80b93fb3333a17d66fc1afdef5a13563.jar:/home/aespinosa/.globus/coasters/cache/axis-f01bcaa789cf9735430b289f6b39ea9a.jar:/home/aespinosa/.globus/coasters/cache/axis-url-fffc9e2378df340c8d3ed0c029867d0d.jar:/home/aespinosa/.globus/coasters/cache/cog-axis-eafb6cd78da733f3293a5508793c10a4.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_service-e49623aeb2d0297615dfb7ad5a834306.jar:/home/aespinosa/.globus/coasters/cache/globus_delegation_stubs-29ce051b29a9422aeba1f60ac205f1b1.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_mds_aggregator_stubs-fbcd9a33c3982fae5a4231ca8f426560.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvous_stubs-d09ea57f3863104dafca984682ec71ff.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rendezvou
>>> s_s
>>
ervice-afc177c2e02fd0e773698f3d478a33ef.jar:/home/aespinosa/.globus/coasters/cache/globus_wsrf_rft_stubs-e3be33b222d03afc750b112c7f638f41.jar:/home/aespinosa/.globus/coasters/cache/gram-utils-ab1a282ee889d381b22051a863f086cb.jar:/home/aespinosa/.globus/coasters/cache/gram-stubs-1bd6f6863c3d4c31bf5aa0dd34adf0be.jar:/home/aespinosa/.globus/coasters/cache/gram-client-197210112784800e635b333acda58ee9.jar:/home/aespinosa/.globus/coasters/cache/naming-resources-d7a5b4123aad30d5dc11ca827aa6177a.jar:/home/aespinosa/.globus/coasters/cache/naming-common-1cfe69c9206c1f13bb328a350e3fb0e4.jar:/home/aespinosa/.globus/coasters/cache/naming-factory-ddb1fb5f295162e0389d713822f1112e.jar:/home/aespinosa/.globus/coasters/cache/naming-java-6f6855fb184b81d050d17a1e938cd2a2.jar:/home/aespinosa/.globus/coasters/cache/saaj-fa0706bd9bcb29f522c1a08db1cbcd94.jar:/home/aespinosa/.globus/coasters/cache/wsdl4j-a0f571fafc
>>>>>>>>
>>>>>>>> 2009/7/13 Mihael Hategan <hategan at mcs.anl.gov>:
>>>>>>>>> On Mon, 2009-07-13 at 12:28 -0500, Michael Wilde wrote:
>>>>>>>>>>>> At the time we did not have a chance to gather
>> detailed evidence,
>>>>>>>>>>>> but I
>>>>>>>>>>>> was surprised by two things:
>>>>>>>>>>>>
>>>>>>>>>>>> - that there were two Java processes and that one
>> was so big. (Are
>>>>>>>>>>>> most
>>>>>>>>>>>> likely the active process was just a child thread
>> of the main
>>>>>>>>>>>> process?)
>>>>>>>>>>> One java process is the bootstrap process (it
>> downloads the coaster
>>>>>>>>>>> jars, sets up the environment and runs the coaster
>> service). It has
>>>>>>>>>>> always been like this. Did you happen to capture the
>> output of ps to
>>>>>>>>>>> a
>>>>>>>>>>> file? That would be useful, because from what you
>> are suggesting, it
>>>>>>>>>>> appears that the bootstrap process is eating 100%
>> CPU. That process
>>>>>>>>>>> should only be sleeping after the service is started.
>>>>>>>>>> I *thought* I captured the output of "top -u
>> sarahs'id -b -d" but I
>>>>>>>>>> cant
>>>>>>>>>> locate it.
>>>>>>>>>>
>>>>>>>>>> As best as I can recall it showed the larger
>> memory-footprint process
>>>>>>>>>> to
>>>>>>>>>> be relatively idle, and the smaller footprint process
>> (about 275MB) to
>>>>>>>>>> be burning 100% of a CPU.
>>>>>>>>> Normally, the smaller footprint process should be the
>> bootstrap. But
>>>>>>>>> that's why I would like the ps output, because it
>> sounds odd.
>>>>>>>>>>  Allan will try to get a snapshot of this shortly.
>>>>>>>>>>
>>>>>>>>>> If this observation if correct, whats the best way to
>> find out where
>>>>>>>>>> its
>>>>>>>>>> spinning? Profiling? Debug logging? Can you get
>> profiling data from a
>>>>>>>>>> JVM that doesnt exit?
>>>>>>>>> Once I know where it is, I can look at the code and
>> then we'll go from
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>>>



More information about the Swift-devel mailing list