[Swift-devel] coaster io with NIO.

David Kelly davidk at ci.uchicago.edu
Tue Apr 10 23:33:20 CDT 2012


Since the latest update which fixes coaster-service, I have tested with two configurations:

1 machine only, 4 jobs per node, 100 200MB files (ran twice, passed twice)
2 MCS machines - swift and coaster-service running on one machine, 1 worker, 4 jobs per node, 500 20MB files (also ran twice, passed twice)

These tests were failing pretty consistently yesterday. I am not positive it is completely fixed yet, but things have definitely improved.

I have never been able to reproduce provider staging problems using jobs per node set of 1. It was only when I got to a value of 4 that I started seeing issues. 

I will write a test tonight that runs on OSG and let you know what happens.

David


----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, April 10, 2012 10:51:05 PM
> Subject: Re: [Swift-devel] coaster io with NIO.
> Thanks, Ketan. David, can you try to reproduce the problem with
> jobsPerNode=1?
> 
> - Mike
> 
> ----- Original Message -----
> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Sent: Tuesday, April 10, 2012 9:31:34 PM
> > Subject: Re: [Swift-devel] coaster io with NIO.
> > Jobspernode setting were indeed 1 on the tests done on osg.
> >
> >
> > I do not recall seeing the blocking messages seen by David's
> > current/recent tests.
> >
> >
> > On Tuesday, April 10, 2012, Michael Wilde wrote:
> >
> >
> > Mihael, while the scenario below seems plausible, I thought that the
> > timeout problem was first detected on OSG nodes, which should have
> > been running with jobsPerNode=1.
> >
> > David, Ketan, can you comment on the jobsPerNode settings for the
> > many
> > tests you have done which encountered this problem?
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" < hategan at mcs.anl.gov >
> > > To: "David Kelly" < davidk at ci.uchicago.edu >
> > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Tuesday, April 10, 2012 7:04:56 PM
> > > Subject: Re: [Swift-devel] coaster io with NIO.
> > > On Tue, 2012-04-10 at 17:25 -0500, David Kelly wrote:
> > > > Yep, I gave it a try with automatic coasters, but am still
> > > > seeing
> > > > the timeouts.
> > > >
> > >
> > > I think I see the problem. With multiple jobs per worker the
> > > situation
> > > may such be that both a stagein and a stageout happen at the same
> > > time
> > > (on the same TCP connection). If the stageout runs out of buffers
> > > the
> > > writing to the socket on the worker side blocks causing the read
> > > loop
> > > to
> > > not happen. This eventually fills the other direction on the TCP
> > > link
> > > and everything deadlocks.
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> > --
> > Ketan
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list