[Swift-devel] Re: Coaster problem on BG/P - worker processes dying (fwd)

Thu Jul 1 13:22:27 CDT 2010

On Thu, 1 Jul 2010, Mihael Hategan wrote:

> On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote:
>> Hi, Justin
>> 
>> Is there any chance that each worker is writing log files to GPFS or
>> writing to RAM, then copying to GPFS?
>> Even in the latter case, we used dd instead of cp on zeptoos, cuz with
>> dd we could set the block size while cp
>> is using a lined buffer to dump data to GPFS, which is quite slow.
> 
> Since some time the worker log level is set to WARN (which only produces
> a message at the start and end) when the number of workers is >= 16.

Right, I have made changes there.

>> Another suspect would be that we are overwhelming the IO nodes. As far
>> as I remember, coaster is running as a
>> service on each compute node with a TCP connection to the Login Node.
>> The communication between Login Node
>> and CN node is handled by a IP forwarding component in zeptoos. In the
>> tests I did before, the Falkon service is not
>> stable with 1024 nodes connecting to the service each with a TCP
>> connection.Can we login the IO nodes while we
>> see those errors?
> 
> Maybe, but then I was able to run with 40k cores while the logging
> scheme above wasn't enabled. Since then, there was a switch to only one
> TCP connection per node (regardless of cores) and the very much reduced
> logging. So I suspect this isn't the problem unless the ZOID NAT got
> messed up.

-- 
Justin M Wozniak