[Swift-devel] Coaster socket issue

Jonathan Monette jonmon at mcs.anl.gov
Wed Mar 28 21:25:34 CDT 2012


So on Beagle(where I have had a run going for about 1.5 days) I get a lot of stuff.  But running the command: lsof -u jonmon | grep java | grep FIFO | wc -l 
I get 116.

By running the command lsof -u jonmon | grep java | grep CLOSE_WAIT | wc -l I get 315.

But by going to /proc/<java.pid>fd and running ls -l | grep socket | wc -l I get 2911.

I am not sure what numbers to believe.

The output from lsof -u jonmon | grep java | grep FIFO is 
java      21823 jonmon   99r     FIFO                0,6         0t0 259630056 pipe
java      21823 jonmon  100w     FIFO                0,6         0t0 259630056 pipe
java      21823 jonmon 2745w     FIFO                0,6         0t0 318767708 pipe
java      21823 jonmon 2747r     FIFO                0,6         0t0 318767709 pipe
java      21823 jonmon 2813r     FIFO                0,6         0t0 318779029 pipe
java      21823 jonmon 2830r     FIFO                0,6         0t0 318767710 pipe
java      21823 jonmon 2874r     FIFO                0,6         0t0 318779030 pipe
java      21823 jonmon 2909w     FIFO                0,6         0t0 318779031 pipe
java      21823 jonmon 2961r     FIFO                0,6         0t0 318484490 pipe
java      21823 jonmon 2964w     FIFO                0,6         0t0 318484491 pipe
java      21823 jonmon 2966r     FIFO                0,6         0t0 318484492 pipe
java      21823 jonmon 2989r     FIFO                0,6         0t0 318558560 pipe
java      21823 jonmon 2991r     FIFO                0,6         0t0 318632607 pipe
java      21823 jonmon 2993w     FIFO                0,6         0t0 318558561 pipe
java      21823 jonmon 2997r     FIFO                0,6         0t0 318558562 pipe
java      21823 jonmon 2999r     FIFO                0,6         0t0 318632608 pipe
java      21823 jonmon 3002r     FIFO                0,6         0t0 318632609 pipe

The count of these pipes seem to go up and down.  A couple minutes ago it was at 116(the above number) but now it is down to ~20.  So the FIFO count is going up and down.  My worry is the socket count and the number of sockets in the CLOSE_WAIT state.  Those seem to vastly out number of pipes, at least according to what is in the fd directory of the process.

On Mar 28, 2012, at 9:11 PM, David Kelly wrote:

> The limit here seems to be 1024.
> 
> Just curious, what happens when you run 'lsof -u jonmon'? For me, I see lines like this that grow over time:
> 
> java    14589 dkelly  220r     FIFO                0,6           601514288 pipe
> java    14589 dkelly  221r     FIFO                0,6           601514581 pipe
> java    14589 dkelly  222w     FIFO                0,6           601514852 pipe
> java    14589 dkelly  223r     FIFO                0,6           601514582 pipe
> 
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "David Kelly" <davidk at ci.uchicago.edu>
>> Cc: "swift-devel at ci.uchicago.edu Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Wednesday, March 28, 2012 8:57:03 PM
>> Subject: Re: [Swift-devel] Coaster socket issue
>> What is the open files limit on that machine(ulimit -n)? I have never
>> witnessed this issue before so it may only appear on machines with
>> relatively low open file limits(raven has 1K but beagle has 60K). This
>> is still something we should look into though.
>> 
>> On Mar 28, 2012, at 8:49 PM, David Kelly wrote:
>> 
>>> 
>>> Strange, I just ran into a similar issues tonight while running on
>>> ibicluster (SGE). I saw the "too many open files" error after
>>> sitting in the queue waiting for a job to start. I restarted the job
>>> and then periodically ran 'lsof' to see the number of java pipes
>>> increasing over time. I thought at first this might be SGE specific,
>>> but perhaps it is something else. (This was with 0.93)
>>> 
>>> ----- Original Message -----
>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>> To: "swift-devel at ci.uchicago.edu Devel"
>>>> <swift-devel at ci.uchicago.edu>
>>>> Sent: Wednesday, March 28, 2012 8:30:52 PM
>>>> Subject: [Swift-devel] Coaster socket issue
>>>> Hello,
>>>> In running the SciColSim app on raven(which is a cluster similar to
>>>> Beagle) I noticed that the app hung. It was not hung where the hang
>>>> checker kicked in but Swift was waiting for jobs to be active but
>>>> there was none submitted to PBS. I took a look at the log file and
>>>> noticed that I had a java.io.IOException thrown for "too many open
>>>> files". Since I killed it I couldn't probe the run but I had the
>>>> same
>>>> run running on Beagle. Upon Mike's suggestion I took a look at the
>>>> /proc/<pid>/fd directory. There were over 2000 sockets in the
>>>> CLOSE_WAIT state with a single message in the receive queue. Raven
>>>> has
>>>> a limit of 1024 open files at a time while Beagle has a limit
>>>> around
>>>> 60K number of files open. I got this limit using ulimit -n.
>>>> 
>>>> So my question is, why is there so many sockets waiting to be
>>>> closed?
>>>> I did some reading about the CLOSE_WAIT state and it seems this
>>>> happens when one of the ends closes there socket but the other does
>>>> not. Is Coaster not closing the socket when a worker shuts down?
>>>> What
>>>> other information should I be looking for to help debug the issue.
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list