[Swift-devel] persistent coasters on OSG

Michael Wilde wilde at mcs.anl.gov
Tue Aug 23 14:34:17 CDT 2011


OK, thanks.  I pointed Ketan to the wrapper script which launches the workers (and which is run as a Condor-G job). This script sets logging on, and tries to send the log back on stdout or stderr.  That needs to be tested, as its tricky to get the log to come back when the jobs are killed.  And its hard to get the logs from OSG.

Maybe we can do a test run where the coaster logs are copied or tee'd to a shared filesystem file on any signal, and/or occasionally, etc.

- Mike


----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>, "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> Sent: Tuesday, August 23, 2011 2:27:54 PM
> Subject: Re: [Swift-devel] persistent coasters on OSG
> If you look through the service log, you see that all "lost connection
> to worker" messages come from workers on nemo. That implies that
> something is wrong there, but I can't tell what it is.
> 
> Perhaps enabling worker logging for workers on nemo might shed some
> light on the issue.
> 
> On Tue, 2011-08-23 at 14:17 -0500, Michael Wilde wrote:
> > Can you describe what you are seeing on Nemo and what to look for
> > there?
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, August 23, 2011 2:15:03 PM
> > > Subject: Re: [Swift-devel] persistent coasters on OSG
> > > It looks like workers on nemo are somehow messed up. Can you find
> > > out
> > > why?
> > >
> > > On Mon, 2011-08-22 at 13:45 -0500, Ketan Maheshwari wrote:
> > > > Hi Mihael, All,
> > > >
> > > >
> > > > I am trying to test the persistent coasters setup with OSG sites
> > > > from
> > > > communicado and see some intermittent exceptions/ jobs failed
> > > > errors
> > > > which eventually succeed on retries.
> > > >
> > > >
> > > > The exceptions I see from the log are mostly low-level network
> > > > exceptions: (Channel Exceptions, Broken Pipe SocketExceptions,
> > > > Timeout, etc.).
> > > >
> > > >
> > > > The runs that I tried were incremental catsn runs with n=1,10,50
> > > > and
> > > > 100 and data.txt=100MB and 200MB.
> > > >
> > > >
> > > > The only run that had the above mentioned errors were the ones
> > > > with
> > > > n=100 and data.txt=200MB.
> > > >
> > > >
> > > > The other runs completed without any errors.
> > > >
> > > >
> > > > I used just one OSG site for these runs.
> > > >
> > > >
> > > > Attaching the sites, log files and a file that contains
> > > > exception
> > > > messages grepped from log files.
> > > >
> > > >
> > > > Any clues as to harden this, I had about 5 errors on today's run
> > > > and
> > > > about 11 on a similar run last week.
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list