[Swift-devel] Re: Coaster problem on BG/P - worker processes dying

Thu Jul 1 12:13:02 CDT 2010

Hi, Justin

Is there any chance that each worker is writing log files to GPFS or 
writing to RAM, then copying to GPFS?
Even in the latter case, we used dd instead of cp on zeptoos, cuz with 
dd we could set the block size while cp
is using a lined buffer to dump data to GPFS, which is quite slow.

Another suspect would be that we are overwhelming the IO nodes. As far 
as I remember, coaster is running as a
service on each compute node with a TCP connection to the Login Node. 
The communication between Login Node
and CN node is handled by a IP forwarding component in zeptoos. In the 
tests I did before, the Falkon service is not
stable with 1024 nodes connecting to the service each with a TCP 
connection.Can we login the IO nodes while we
see those errors?

Anyway, I can't tell anything exactly right now.

best
zhao

Justin M Wozniak wrote:
> On Thu, 1 Jul 2010, Michael Wilde wrote:
>
>> Justin, can you send a brief update to the list on the coaster 
>> problem (workers exiting after a few jobs) that is blocking you on 
>> the BG/P, and how you are re-working worker logging to debug it?
>
> A paste from a previous email is below (both BG/P systems are down due 
> to cooling issues today).
>
> So far, the issue only appears after several thousand jobs run on at 
> least 512 nodes.
>
> I'm pretty close to generating the logging I need to track this down.  
> I have broken down the worker logs into one log per worker script...
>
> Paste:
>
> Running on the Intrepid compute nodes.  In the last few runs I've only 
> seen it in the 512 node case (I think this worked at least once), not 
> 256 nodes, but that could be just because this is rare.
>
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): 
> handling reply timeout;
> sendReqTime=100618-160429.10
> 8, sendTime=100618-160429.108, now=100618-160629.117
> 2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
> re-sending
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) 
>
>         at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) 
>
>         at java.util.TimerThread.mainLoop(Timer.java:537)
>         at java.util.TimerThread.run(Timer.java:487)
> 2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, 
> SUBMITJOB) on MetaChannel: 855782146 ->
> SC-0618-370320-0
> 00000-001756
> 2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel 
> Channel IOException
> java.net.SocketException: Broken pipe
>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>         at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>         at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea 
>
> mKar
> ajanChannel.java:292)
>         at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream 
>
> Kara
> janChannel.java:244)
>
>