[Swift-devel] Re: Coaster problem on BG/P - worker processes dying

Thu Jul 1 12:18:31 CDT 2010

On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote:
> Hi, Justin
> 
> Is there any chance that each worker is writing log files to GPFS or 
> writing to RAM, then copying to GPFS?
> Even in the latter case, we used dd instead of cp on zeptoos, cuz with 
> dd we could set the block size while cp
> is using a lined buffer to dump data to GPFS, which is quite slow.

Since some time the worker log level is set to WARN (which only produces
a message at the start and end) when the number of workers is >= 16.

> 
> Another suspect would be that we are overwhelming the IO nodes. As far 
> as I remember, coaster is running as a
> service on each compute node with a TCP connection to the Login Node. 
> The communication between Login Node
> and CN node is handled by a IP forwarding component in zeptoos. In the 
> tests I did before, the Falkon service is not
> stable with 1024 nodes connecting to the service each with a TCP 
> connection.Can we login the IO nodes while we
> see those errors?

Maybe, but then I was able to run with 40k cores while the logging
scheme above wasn't enabled. Since then, there was a switch to only one
TCP connection per node (regardless of cores) and the very much reduced
logging. So I suspect this isn't the problem unless the ZOID NAT got
messed up.