[Swift-devel] Coaster socket issue
Jonathan Monette
jonmon at mcs.anl.gov
Wed Mar 28 20:30:52 CDT 2012
Hello,
In running the SciColSim app on raven(which is a cluster similar to Beagle) I noticed that the app hung. It was not hung where the hang checker kicked in but Swift was waiting for jobs to be active but there was none submitted to PBS. I took a look at the log file and noticed that I had a java.io.IOException thrown for "too many open files". Since I killed it I couldn't probe the run but I had the same run running on Beagle. Upon Mike's suggestion I took a look at the /proc/<pid>/fd directory. There were over 2000 sockets in the CLOSE_WAIT state with a single message in the receive queue. Raven has a limit of 1024 open files at a time while Beagle has a limit around 60K number of files open. I got this limit using ulimit -n.
So my question is, why is there so many sockets waiting to be closed? I did some reading about the CLOSE_WAIT state and it seems this happens when one of the ends closes there socket but the other does not. Is Coaster not closing the socket when a worker shuts down? What other information should I be looking for to help debug the issue.
More information about the Swift-devel
mailing list