[Swift-devel] Re: Coaster problem on BG/P - worker processes dying

Mihael Hategan hategan at mcs.anl.gov
Thu Jul 1 12:08:08 CDT 2010


That typically is an indication that something went wrong with the
worker or the worker connection. It's also possible that the message
queues are loaded enough to not be able to process everything in time.
The coaster logs have some logging info that displays that information.

On Thu, 2010-07-01 at 11:56 -0500, Justin M Wozniak wrote:
> On Thu, 1 Jul 2010, Michael Wilde wrote:
> 
> > Justin, can you send a brief update to the list on the coaster problem 
> > (workers exiting after a few jobs) that is blocking you on the BG/P, and 
> > how you are re-working worker logging to debug it?
> 
> A paste from a previous email is below (both BG/P systems are down due to 
> cooling issues today).
> 
> So far, the issue only appears after several thousand jobs run on at least 
> 512 nodes.
> 
> I'm pretty close to generating the logging I need to track this down.  I 
> have broken down the worker logs into one log per worker script...
> 
> Paste:
> 
> Running on the Intrepid compute nodes.  In the last few runs I've only 
> seen it in the 512 node case (I think this worked at least once), not 256 
> nodes, but that could be just because this is rare.
> 
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): handling 
> reply timeout;
> sendReqTime=100618-160429.10
> 8, sendTime=100618-160429.108, now=100618-160629.117
> 2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
> re-sending
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault was: 
> Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
>          at java.util.TimerThread.mainLoop(Timer.java:537)
>          at java.util.TimerThread.run(Timer.java:487)
> 2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, SUBMITJOB) 
> on MetaChannel: 855782146 ->
> SC-0618-370320-0
> 00000-001756
> 2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel Channel 
> IOException
> java.net.SocketException: Broken pipe
>          at java.net.SocketOutputStream.socketWrite0(Native Method)
>          at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>          at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>          at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea
> mKar
> ajanChannel.java:292)
>          at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream
> Kara
> janChannel.java:244)
> 
> 





More information about the Swift-devel mailing list