[Swift-devel] Coaster urandom fix?

Michael Wilde wilde at mcs.anl.gov
Tue Jan 27 23:23:52 CST 2009


While looking into the coaster "-l" issue on teraport I noticed another 
diff in my svn tree (not checked in) for the urandom problem below.

(-Djava.security.egd=file:///dev/urandom)

Did you make that fix somewhere? In my code it was in bootstrap.sh, but 
I dont see it there as of 2261:

--- modules/provider-coaster/resources/bootstrap.sh     (revision 2261)
+++ modules/provider-coaster/resources/bootstrap.sh     (working copy)
@@ -11,6 +11,11 @@
         rm -f $DJ
         exit 1
  }
+if [ -f ~/.myetcprofile ]; then
+  source ~/.myetcprofile
+else
+  source /etc/profile
+fi
  if [ "$L" == "" ]; then
         L=~/coaster-boot-$ID.log
  fi
@@ -52,8 +57,13 @@
  fi
  echo "JAVA=$JAVA" >>$L
  if [ -x $JAVA ]; then
+<<<<<<< .mine
+       echo "$JAVA -Djava.home="$JAVA_HOME" 
-DX509_USER_PROXY="$X509_USER_PROXY" -DGLOBUS_HOSTNAME="$H" 
-Djava.security.egd=file:///dev/urandom -jar $DJ $BS $LMD5 $LS $ID" >>$L
+       $JAVA -Djava.home="$JAVA_HOME" 
-DGLOBUS_TCP_PORT_RANGE="$GLOBUS_TCP_PORT_RANGE" 
-DX509_USER_PROXY="$X509_USER_PROXY" -DX509_CERT_DIR="$X509_CERT_DIR" 
-DGLOBUS_HOSTNAME="$H" -Djava.security.egd=file:///dev/urandom -jar $DJ 
$BS $LMD5 $LS $ID >>$L 2>&1
+=======
         echo "$JAVA -Djava.home="$JAVA_HOME" 
-DX509_USER_PROXY="$X509_USER_PROXY" -DGLOBUS_HOSTNAME="$H" -jar $DJ $BS 
$LMD5 $LS $ID" >>$L
         $JAVA -Djava.home="$JAVA_HOME" 
-DGLOBUS_TCP_PORT_RANGE="$GLOBUS_TCP_PORT_RANGE" 
-DX509_USER_PROXY="$X509_USER_PROXY" -DX509_CERT_DIR="$X509_CERT_DIR" 
-DGLOBUS_HOSTNAME="$H" -jar $DJ $BS $LMD5 $LS $ID
+>>>>>>> .r2261
         EC=$?
         echo "EC: $EC" >>$L
         rm -f $DJ
com$

- Mike




On 7/29/08 1:57 PM, Mihael Hategan wrote:
> On Tue, 2008-07-29 at 13:23 -0500, Michael Wilde wrote:
> 
>> Another possibility is the /dev/random delay in generating an id due ot 
>> lack of server entropy. Now *that* would explain things, as its right 
>> where the delay is occurring:
>>
>>      private void startWorker(int maxWallTime, Task prototype)
>>              throws InvalidServiceContactException {
>>          int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<<
>>          if (logger.isInfoEnabled()) {
>>              logger.info("Starting worker with id=" + id + " and
>>          }
>> which uses SecureRandom.getInstance("SHA1PRNG")
>>
>> This just occurred to me and is perhaps a more likely explanation. Is 
>> this the same scenario that was causing the Swift client to encounter 
>> long delays as it started trivial workflows?  How was that eventually fixed?
> 
> Hmm. Yes. I'll change the bootstrap class to start the service
> with /dev/urandom instead (if available).
> 
>> I can stub this out with a simple number generator and test. And/or time 
>> SecureRandom in a standalone program.
>>
>> - Mike
>>
>>
>>
>>
>>
>> On 7/29/08 12:06 AM, Michael Wilde wrote:
>>> hmmm. my debug statement didnt print. but this time the job on abe ran ok.
>>>
>>> Tomorrow I'll run more tests and see how stable it is there, and why my 
>>> logging calls never showed up.
>>>
>>> - Mike
>>>
>>>
>>> On 7/28/08 11:45 PM, Michael Wilde wrote:
>>>> Ive moved on, and put a temp hack in to not use -l and instead run 
>>>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt.
>>>>
>>>> .myetcprofile on abe is /etc/profile with the problematic code removed.
>>>>
>>>> Now abe gets past the problem and runs bootstrap.sh ok.
>>>>
>>>> The sequence runs OK up to the point where the service on abe's 
>>>> headnode  receives a message to start a job.
>>>>
>>>> AT this point, the service on abe seems to hang.
>>>>
>>>> Comparing to the message sequence on mercury, which works, I see this:
>>>>
>>>> *** mercury:
>>>>
>>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
>>>> SUBMITJOB(identity=1217268111318
>>>> executable=/bin/bash
>>>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h
>>>> arg=shared/wrapper.sh
>>>> arg=echo-myx2e6xi
>>>> arg=-jobdir
>>>> arg=m
>>>> arg=-e
>>>> arg=/bin/echo
>>>> arg=-out
>>>> arg=echo_s000.txt
>>>> arg=-err
>>>> arg=stderr.txt
>>>> arg=-i
>>>> arg=-d
>>>> ar)
>>>> [ChannelManager] DEBUG Channel multiplexer  -
>>>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
>>>> [ChannelManager] DEBUG Channel multiplexer  - Found 
>>>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS
>>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
>>>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310)
>>>> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
>>>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310
>>>> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
>>>> found. Attempting to start a new one.
>>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
>>>>
>>>> [WorkerManager] INFO  Worker Manager  - Starting worker with 
>>>> id=-615912369 and maxwalltime=6060s
>>>> Worker start provider: gt2
>>>> Worker start JM: pbs
>>>>
>>>> *** abe:
>>>>
>>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND< 2 
>>>> SUBMITJOB(identity=1217291444315
>>>> executable=/bin/bash
>>>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc
>>>> arg=shared/wrapper.sh
>>>> arg=echo-zc5mt6xi
>>>> arg=-jobdir
>>>> arg=z
>>>> arg=-e
>>>> arg=/bin/echo
>>>> arg=-out
>>>> arg=echo_s000.txt
>>>> arg=-err
>>>> arg=stderr.txt
>>>> arg=-i
>>>> arg=-d
>>>> arg=
>>>> ar)
>>>> [ChannelManager] DEBUG Channel multiplexer  -
>>>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
>>>> [ChannelManager] DEBUG Channel multiplexer  - Found 
>>>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS
>>>> [RequestHandler] DEBUG Channel multiplexer  - GSSC-null: HND> 2 
>>>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043)
>>>> [Replier] DEBUG Worker 1  - Replier(GSSC-null)REPL>: tag = 2, fin = 
>>>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043
>>>> [WorkerManager] INFO  Coaster Queue Processor  - No suitable worker 
>>>> found. Attempting to start a new one.
>>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe 
>>>>
>>>> [AbstractKarajanChannel] DEBUG Channel multiplexer  - GSSC-null REQ<: 
>>>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE
>>>>
>>>> ***
>>>>
>>>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, 
>>>> after abe's service hangs, but Im not sure.
>>>>
>>>> What it looks like to me is that what should should happen on abe is 
>>>> this:
>>>>
>>>> [WorkerManager] INFO  Worker Manager  - Got allocation request: 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 
>>>>
>>>> [WorkerManager] INFO  Worker Manager  - Starting worker with 
>>>> id=-615912369 and maxwalltime=6060s
>>>>
>>>> but on abe the "Worker Manager  - Starting worker" is never seen.
>>>>
>>>> Looking at WorkerManager.run() its hard to see how the "Starting 
>>>> worker" message could *not* show up right after "Got allocation 
>>>> request", but there must be some sequence of events that causes this.
>>>>
>>>> Abe is an 8-core system. Is there perhaps more opportunity for a 
>>>> multi-thread race or deadlock that could cause this?
>>>>
>>>> I will insert some more debug logging and try a few more times to see 
>>>> if  thing shang in this manner every time or not.
>>>>
>>>> - Mike
>>>>
>>>> ps client Logs with abe server side boot logs are on CI net in 
>>>> ~wilde/coast/run11
>>>>
>>>>
>>>>
>>>> On 7/28/08 10:50 PM, Mihael Hategan wrote:
>>>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote:
>>>>>> On Mon, 28 Jul 2008, Michael Wilde wrote:
>>>>>>
>>>>>>> So it looks like something in the job specs that is launching 
>>>>>>> coaster for
>>>>>>> gt2:pbs is not being accepted by abe.
>>>>>> ok. TeraGrid's unified account system is insufficiently unified for 
>>>>>> me to be able to access abe, but they are aware of that; if and when 
>>>>>> I am reunified, I'll try this out myself.
>>>>> Not to be cynical or anything, but that unified thing: never worked.
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list