[Swift-devel] Re: Coaster problem on BG/P - worker processes dying

Thu Jul 1 13:20:16 CDT 2010

On Thu, 1 Jul 2010, Zhao Zhang wrote:

> Is there any chance that each worker is writing log files to GPFS or 
> writing to RAM, then copying to GPFS? Even in the latter case, we used 
> dd instead of cp on zeptoos, cuz with dd we could set the block size 
> while cp is using a lined buffer to dump data to GPFS, which is quite 
> slow.

I have modified the perl script to write directly to a unique file per 
worker script, directly to GPFS.  (Speed is not an issue right now.)

> Another suspect would be that we are overwhelming the IO nodes. As far 
> as I remember, coaster is running as a service on each compute node with 
> a TCP connection to the Login Node. The communication between Login Node 
> and CN node is handled by a IP forwarding component in zeptoos. In the 
> tests I did before, the Falkon service is not stable with 1024 nodes 
> connecting to the service each with a TCP connection.Can we login the IO 
> nodes while we see those errors?

That seems like a possibility.  If I can whittle the problem down to that 
level we will have something to report to the zepto team.

 	Thanks

> Justin M Wozniak wrote:
>> On Thu, 1 Jul 2010, Michael Wilde wrote:
>> 
>>> Justin, can you send a brief update to the list on the coaster problem 
>>> (workers exiting after a few jobs) that is blocking you on the BG/P, and 
>>> how you are re-working worker logging to debug it?
>> 
>> A paste from a previous email is below (both BG/P systems are down due to 
>> cooling issues today).
>> 
>> So far, the issue only appears after several thousand jobs run on at least 
>> 512 nodes.
>> 
>> I'm pretty close to generating the logging I need to track this down.  I 
>> have broken down the worker logs into one log per worker script...
>> 
>> Paste:
>> 
>> Running on the Intrepid compute nodes.  In the last few runs I've only seen 
>> it in the 512 node case (I think this worked at least once), not 256 nodes, 
>> but that could be just because this is rare.
>> 
>> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): handling 
>> reply timeout;
>> sendReqTime=100618-160429.10
>> 8, sendTime=100618-160429.108, now=100618-160629.117
>> 2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
>> re-sending
>> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault was: 
>> Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>         at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
>>         at 
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
>>         at java.util.TimerThread.mainLoop(Timer.java:537)
>>         at java.util.TimerThread.run(Timer.java:487)
>> 2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, SUBMITJOB) on 
>> MetaChannel: 855782146 ->
>> SC-0618-370320-0
>> 00000-001756
>> 2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel Channel 
>> IOException
>> java.net.SocketException: Broken pipe
>>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>>         at 
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>>         at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>>         at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea 
>> mKar
>> ajanChannel.java:292)
>>         at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream 
>> Kara
>> janChannel.java:244)
>> 
>> 
>

-- 
Justin M Wozniak