[Swift-devel] Re: Coaster problem on BG/P - worker processes dying
Justin M Wozniak
wozniak at mcs.anl.gov
Thu Jul 1 11:56:13 CDT 2010
On Thu, 1 Jul 2010, Michael Wilde wrote:
> Justin, can you send a brief update to the list on the coaster problem
> (workers exiting after a few jobs) that is blocking you on the BG/P, and
> how you are re-working worker logging to debug it?
A paste from a previous email is below (both BG/P systems are down due to
cooling issues today).
So far, the issue only appears after several thousand jobs run on at least
512 nodes.
I'm pretty close to generating the logging I need to track this down. I
have broken down the worker logs into one log per worker script...
Paste:
Running on the Intrepid compute nodes. In the last few runs I've only
seen it in the 512 node case (I think this worked at least once), not 256
nodes, but that could be just because this is rare.
2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB): handling
reply timeout;
sendReqTime=100618-160429.10
8, sendTime=100618-160429.108, now=100618-160629.117
2010-06-18 16:06:29,117-0500 INFO Command Command(2, SUBMITJOB):
re-sending
2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB)fault was:
Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
at java.util.TimerThread.mainLoop(Timer.java:537)
at java.util.TimerThread.run(Timer.java:487)
2010-06-18 16:06:29,118-0500 INFO Command Sending Command(2, SUBMITJOB)
on MetaChannel: 855782146 ->
SC-0618-370320-0
00000-001756
2010-06-18 16:06:29,119-0500 INFO AbstractStreamKarajanChannel Channel
IOException
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea
mKar
ajanChannel.java:292)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream
Kara
janChannel.java:244)
--
Justin M Wozniak
More information about the Swift-devel
mailing list