[Swift-devel] Coaster socket issue

Jonathan Monette jonmon at mcs.anl.gov
Wed Mar 28 21:26:17 CDT 2012


So it looked like the first command to fail after the IOException in the log was qstat.  It also couldn't open any new wrapper files for the jobs.

On Mar 28, 2012, at 9:21 PM, Michael Wilde wrote:

> Now that I think about it, I suspect the pipes may be from Swift running various commands, like qsub/qstat from the localscheduler provider, and/or app() calls from the local execution provider. I dint know if we ever paid much attention whether these were all getting cleaned up.
> 
> - Mike
> 
> ----- Original Message -----
>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>> To: "David Kelly" <davidk at ci.uchicago.edu>
>> Cc: "swift-devel at ci.uchicago.edu Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Wednesday, March 28, 2012 9:10:38 PM
>> Subject: Re: [Swift-devel] Coaster socket issue
>> I think that on Jon's Beagle runs we say about 100 pipes but several
>> thousand sockets, so we didnt pay any attention to the pipes (yet).
>> 
>> The sockets were clearly from workers to the coaster service.
>> 
>> I have no idea yet what the pipes are. ls -l of /proc/fd/ does a nice
>> job of trying to identify and format the file name or object
>> associated with each file descriptor. I suspect its doing the same
>> thing lsof does.
>> 
>> - Mike
>> 
>> ----- Original Message -----
>>> From: "David Kelly" <davidk at ci.uchicago.edu>
>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>> Cc: "swift-devel at ci.uchicago.edu Devel"
>>> <swift-devel at ci.uchicago.edu>
>>> Sent: Wednesday, March 28, 2012 8:49:21 PM
>>> Subject: Re: [Swift-devel] Coaster socket issue
>>> Strange, I just ran into a similar issues tonight while running on
>>> ibicluster (SGE). I saw the "too many open files" error after
>>> sitting
>>> in the queue waiting for a job to start. I restarted the job and
>>> then
>>> periodically ran 'lsof' to see the number of java pipes increasing
>>> over time. I thought at first this might be SGE specific, but
>>> perhaps
>>> it is something else. (This was with 0.93)
>>> 
>>> ----- Original Message -----
>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>> To: "swift-devel at ci.uchicago.edu Devel"
>>>> <swift-devel at ci.uchicago.edu>
>>>> Sent: Wednesday, March 28, 2012 8:30:52 PM
>>>> Subject: [Swift-devel] Coaster socket issue
>>>> Hello,
>>>> In running the SciColSim app on raven(which is a cluster similar
>>>> to
>>>> Beagle) I noticed that the app hung. It was not hung where the
>>>> hang
>>>> checker kicked in but Swift was waiting for jobs to be active but
>>>> there was none submitted to PBS. I took a look at the log file and
>>>> noticed that I had a java.io.IOException thrown for "too many open
>>>> files". Since I killed it I couldn't probe the run but I had the
>>>> same
>>>> run running on Beagle. Upon Mike's suggestion I took a look at the
>>>> /proc/<pid>/fd directory. There were over 2000 sockets in the
>>>> CLOSE_WAIT state with a single message in the receive queue. Raven
>>>> has
>>>> a limit of 1024 open files at a time while Beagle has a limit
>>>> around
>>>> 60K number of files open. I got this limit using ulimit -n.
>>>> 
>>>> So my question is, why is there so many sockets waiting to be
>>>> closed?
>>>> I did some reading about the CLOSE_WAIT state and it seems this
>>>> happens when one of the ends closes there socket but the other
>>>> does
>>>> not. Is Coaster not closing the socket when a worker shuts down?
>>>> What
>>>> other information should I be looking for to help debug the issue.
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> 
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list