[Swift-devel] Re: Coaster problem on BG/P - worker processes dying
Mihael Hategan
hategan at mcs.anl.gov
Thu Jul 1 12:18:31 CDT 2010
On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote:
> Hi, Justin
>
> Is there any chance that each worker is writing log files to GPFS or
> writing to RAM, then copying to GPFS?
> Even in the latter case, we used dd instead of cp on zeptoos, cuz with
> dd we could set the block size while cp
> is using a lined buffer to dump data to GPFS, which is quite slow.
Since some time the worker log level is set to WARN (which only produces
a message at the start and end) when the number of workers is >= 16.
>
> Another suspect would be that we are overwhelming the IO nodes. As far
> as I remember, coaster is running as a
> service on each compute node with a TCP connection to the Login Node.
> The communication between Login Node
> and CN node is handled by a IP forwarding component in zeptoos. In the
> tests I did before, the Falkon service is not
> stable with 1024 nodes connecting to the service each with a TCP
> connection.Can we login the IO nodes while we
> see those errors?
Maybe, but then I was able to run with 40k cores while the logging
scheme above wasn't enabled. Since then, there was a switch to only one
TCP connection per node (regardless of cores) and the very much reduced
logging. So I suspect this isn't the problem unless the ZOID NAT got
messed up.
More information about the Swift-devel
mailing list