[Swift-devel] Re: Coaster problem on BG/P - worker processes dying
Justin M Wozniak
wozniak at mcs.anl.gov
Thu Jul 1 13:20:16 CDT 2010
On Thu, 1 Jul 2010, Zhao Zhang wrote:
> Is there any chance that each worker is writing log files to GPFS or
> writing to RAM, then copying to GPFS? Even in the latter case, we used
> dd instead of cp on zeptoos, cuz with dd we could set the block size
> while cp is using a lined buffer to dump data to GPFS, which is quite
> slow.
I have modified the perl script to write directly to a unique file per
worker script, directly to GPFS. (Speed is not an issue right now.)
> Another suspect would be that we are overwhelming the IO nodes. As far
> as I remember, coaster is running as a service on each compute node with
> a TCP connection to the Login Node. The communication between Login Node
> and CN node is handled by a IP forwarding component in zeptoos. In the
> tests I did before, the Falkon service is not stable with 1024 nodes
> connecting to the service each with a TCP connection.Can we login the IO
> nodes while we see those errors?
That seems like a possibility. If I can whittle the problem down to that
level we will have something to report to the zepto team.
Thanks
> Justin M Wozniak wrote:
>> On Thu, 1 Jul 2010, Michael Wilde wrote:
>>
>>> Justin, can you send a brief update to the list on the coaster problem
>>> (workers exiting after a few jobs) that is blocking you on the BG/P, and
>>> how you are re-working worker logging to debug it?
>>
>> A paste from a previous email is below (both BG/P systems are down due to
>> cooling issues today).
>>
>> So far, the issue only appears after several thousand jobs run on at least
>> 512 nodes.
>>
>> I'm pretty close to generating the logging I need to track this down. I
>> have broken down the worker logs into one log per worker script...
>>
>> Paste:
>>
>> Running on the Intrepid compute nodes. In the last few runs I've only seen
>> it in the 512 node case (I think this worked at least once), not 256 nodes,
>> but that could be just because this is rare.
>>
>> 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB): handling
>> reply timeout;
>> sendReqTime=100618-160429.10
>> 8, sendTime=100618-160429.108, now=100618-160629.117
>> 2010-06-18 16:06:29,117-0500 INFO Command Command(2, SUBMITJOB):
>> re-sending
>> 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB)fault was:
>> Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
>> at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
>> at java.util.TimerThread.mainLoop(Timer.java:537)
>> at java.util.TimerThread.run(Timer.java:487)
>> 2010-06-18 16:06:29,118-0500 INFO Command Sending Command(2, SUBMITJOB) on
>> MetaChannel: 855782146 ->
>> SC-0618-370320-0
>> 00000-001756
>> 2010-06-18 16:06:29,119-0500 INFO AbstractStreamKarajanChannel Channel
>> IOException
>> java.net.SocketException: Broken pipe
>> at java.net.SocketOutputStream.socketWrite0(Native Method)
>> at
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>> at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>> at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea
>> mKar
>> ajanChannel.java:292)
>> at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream
>> Kara
>> janChannel.java:244)
>>
>>
>
--
Justin M Wozniak
More information about the Swift-devel
mailing list