[Swift-devel] Re: Coaster problem on BG/P - worker processes dying

Justin M Wozniak wozniak at mcs.anl.gov
Thu Jul 1 11:56:13 CDT 2010


On Thu, 1 Jul 2010, Michael Wilde wrote:

> Justin, can you send a brief update to the list on the coaster problem 
> (workers exiting after a few jobs) that is blocking you on the BG/P, and 
> how you are re-working worker logging to debug it?

A paste from a previous email is below (both BG/P systems are down due to 
cooling issues today).

So far, the issue only appears after several thousand jobs run on at least 
512 nodes.

I'm pretty close to generating the logging I need to track this down.  I 
have broken down the worker logs into one log per worker script...

Paste:

Running on the Intrepid compute nodes.  In the last few runs I've only 
seen it in the 512 node case (I think this worked at least once), not 256 
nodes, but that could be just because this is rare.

2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): handling 
reply timeout;
sendReqTime=100618-160429.10
8, sendTime=100618-160429.108, now=100618-160629.117
2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
re-sending
2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault was: 
Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
         at java.util.TimerThread.mainLoop(Timer.java:537)
         at java.util.TimerThread.run(Timer.java:487)
2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, SUBMITJOB) 
on MetaChannel: 855782146 ->
SC-0618-370320-0
00000-001756
2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel Channel 
IOException
java.net.SocketException: Broken pipe
         at java.net.SocketOutputStream.socketWrite0(Native Method)
         at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
         at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
         at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea
mKar
ajanChannel.java:292)
         at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream
Kara
janChannel.java:244)


-- 
Justin M Wozniak



More information about the Swift-devel mailing list