[Swift-devel] Coaster socket issue
Jonathan Monette
jonmon at mcs.anl.gov
Tue Apr 10 15:27:07 CDT 2012
Mihael,
So the fix for the socket issue for bug 762 did not fix the issue. Over the weekend I ran a large scale run and encountered the same IOException for too many open file descriptors. Upon checking /proc/<pid>/fd, there were 1017 open file descriptors. The limit on the machine was 1024 so 1017 is dangerously close to the limit. I am assuming that once the limit was reached some fds closed.
Most of the fds in that directory were sockets. I then checked netstat -a and found several sockets in the CLOSE_WAIT state. They had the form:
tcp 1 0 nid00008:51313 nid00014:58012 CLOSE_WAIT
They run time for the PBS jobs were short(only about 30 mins) and the swift run was running for over 12 hours. Even doing the math for that does not explain why ~900 of the fds in /proc were sockets.
Upon researching the "CLOSE_WAIT" state issue I found several posts about this. They all say that this is bad but they also had different reason why this would show up. One thing that all these CLOSE_WAIT sockets reported by netstat have in common is that the have message in the receive queue(the second column in the output I pasted above). My current theory is that the socket is waiting for that message to be read before actually closing. Do you think that is possible? I do not have any other evidence or data about this issue but I will be gathering data very soon. If you have any specific data you would like to see please let me know and I can gather that for you.
What are your thoughts on this issue?
On Mar 28, 2012, at 9:30 PM, Michael Wilde wrote:
> Does
> ls -l /proc/14598/fd
> tell you anything more?
>
> Sounds to me like swift is trying to qstat a qsub'ed job. Perhaps some incompatibility between the SGE provider and the local SGE release? We've seen similar things with older (or newer) SGE releases. (I think you in fact diagnosed some of these issues as I recall...)
>
> - Mike
>
> ----- Original Message -----
>> From: "David Kelly" <davidk at ci.uchicago.edu>
>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> Cc: "swift-devel at ci.uchicago.edu Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Wednesday, March 28, 2012 9:11:31 PM
>> Subject: Re: [Swift-devel] Coaster socket issue
>> The limit here seems to be 1024.
>>
>> Just curious, what happens when you run 'lsof -u jonmon'? For me, I
>> see lines like this that grow over time:
>>
>> java 14589 dkelly 220r FIFO 0,6 601514288 pipe
>> java 14589 dkelly 221r FIFO 0,6 601514581 pipe
>> java 14589 dkelly 222w FIFO 0,6 601514852 pipe
>> java 14589 dkelly 223r FIFO 0,6 601514582 pipe
>>
>>
>> ----- Original Message -----
>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>> To: "David Kelly" <davidk at ci.uchicago.edu>
>>> Cc: "swift-devel at ci.uchicago.edu Devel"
>>> <swift-devel at ci.uchicago.edu>
>>> Sent: Wednesday, March 28, 2012 8:57:03 PM
>>> Subject: Re: [Swift-devel] Coaster socket issue
>>> What is the open files limit on that machine(ulimit -n)? I have
>>> never
>>> witnessed this issue before so it may only appear on machines with
>>> relatively low open file limits(raven has 1K but beagle has 60K).
>>> This
>>> is still something we should look into though.
>>>
>>> On Mar 28, 2012, at 8:49 PM, David Kelly wrote:
>>>
>>>>
>>>> Strange, I just ran into a similar issues tonight while running on
>>>> ibicluster (SGE). I saw the "too many open files" error after
>>>> sitting in the queue waiting for a job to start. I restarted the
>>>> job
>>>> and then periodically ran 'lsof' to see the number of java pipes
>>>> increasing over time. I thought at first this might be SGE
>>>> specific,
>>>> but perhaps it is something else. (This was with 0.93)
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>> To: "swift-devel at ci.uchicago.edu Devel"
>>>>> <swift-devel at ci.uchicago.edu>
>>>>> Sent: Wednesday, March 28, 2012 8:30:52 PM
>>>>> Subject: [Swift-devel] Coaster socket issue
>>>>> Hello,
>>>>> In running the SciColSim app on raven(which is a cluster similar
>>>>> to
>>>>> Beagle) I noticed that the app hung. It was not hung where the
>>>>> hang
>>>>> checker kicked in but Swift was waiting for jobs to be active but
>>>>> there was none submitted to PBS. I took a look at the log file
>>>>> and
>>>>> noticed that I had a java.io.IOException thrown for "too many
>>>>> open
>>>>> files". Since I killed it I couldn't probe the run but I had the
>>>>> same
>>>>> run running on Beagle. Upon Mike's suggestion I took a look at
>>>>> the
>>>>> /proc/<pid>/fd directory. There were over 2000 sockets in the
>>>>> CLOSE_WAIT state with a single message in the receive queue.
>>>>> Raven
>>>>> has
>>>>> a limit of 1024 open files at a time while Beagle has a limit
>>>>> around
>>>>> 60K number of files open. I got this limit using ulimit -n.
>>>>>
>>>>> So my question is, why is there so many sockets waiting to be
>>>>> closed?
>>>>> I did some reading about the CLOSE_WAIT state and it seems this
>>>>> happens when one of the ends closes there socket but the other
>>>>> does
>>>>> not. Is Coaster not closing the socket when a worker shuts down?
>>>>> What
>>>>> other information should I be looking for to help debug the
>>>>> issue.
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
More information about the Swift-devel
mailing list