[Swift-devel] Re: Provider staging error in long-running test

Michael Wilde wilde at mcs.anl.gov
Sun Nov 28 00:46:26 CST 2010


Great! The test that failed after 3000+ transfers now ran all 10,000 OK.
Im putting that in a loop now to see if it runs all night. Looks promising!

- Mike


----- Original Message -----
> So I think that was due to incorrect assumption that message headers
> will never be broken up into pieces by the TCP layer. That caused the
> worker to fail, presumably under high load, but I cannot be sure about
> the exact conditions that led to the problem (and therefore I cannot
> be
> sure of the solution).
> 
> I have added code to read things from a socket in a more resilient
> fashion.
> 
> I have also removed the idle timeout from the worker. That should not
> bother us any more.
> 
> Mihael
> 
> On Mon, 2010-11-22 at 18:29 -0800, Mihael Hategan wrote:
> > Ok. So that doesn't look like it's a staging problem specifically,
> > but
> > more like something with the comm library. I'll have to look at the
> > logs. And I can foresee some free time coming in a couple of days
> > just
> > for that!
> >
> > Mihael
> >
> > On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote:
> > > Mihael, here is bug 3:
> > >
> > > I was testing a foreach loop doing a cat of 10,000 input files of
> > > sizes up to about 300-400K each. The test hit an error after
> > > around 3,491 files:
> > >
> > > Progress: Selecting site:1008 Submitted:12 Active:3 Finished
> > > successfully:3476
> > > Progress: Selecting site:1008 Submitted:13 Active:3 Finished
> > > successfully:3491
> > > Failed to shut down channel
> > > java.lang.NullPointerException
> > >         at
> > >         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
> > >         at
> > >         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
> > >
> > > The test was executed on PADS login1 like this:
> > >
> > > cd /home/wilde/swift/lab
> > > ./run.local.coast.ps.sh catsall
> > >
> > > log file: catsall-20101121-2239-oc2flmn0.log
> > >
> > > sites.xml:
> > >
> > > <config>
> > >   <pool handle="localhost">
> > >     <!-- <execution provider="coaster-persistent"
> > >     url="http://login1.pads.ci.uchicago.edu:"
> > >     jobmanager="local:local"/> -->
> > >     <execution provider="coaster" url="none"
> > >     jobmanager="local:local"/>
> > >     <!-- <profile namespace="globus"
> > >     key="workerManager">passive</profile> -->
> > >     <profile namespace="globus" key="workersPerNode">8</profile>
> > >     <profile namespace="globus" key="slots">1</profile>
> > >     <profile namespace="globus" key="maxnodes">1</profile>
> > >     <profile key="jobThrottle" namespace="karajan">.15</profile>
> > >     <profile namespace="karajan"
> > >     key="initialScore">10000</profile>
> > >     <profile namespace="swift" key="stagingMethod">proxy</profile>
> > >     <workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
> > >   </pool>
> > > </config>
> > >
> > > login1$ cat cf
> > > wrapperlog.always.transfer=true
> > > sitedir.keep=true
> > > execution.retries=0
> > > lazy.errors=false
> > > status.mode=provider
> > > use.provider.staging=true
> > > provider.staging.pin.swiftfiles=false
> > > login1$ cat catsall.swift
> > > type file;
> > >
> > > app (file o) cat (file i)
> > > {
> > >   cat @i stdout=@o;
> > > }
> > >
> > > file infile[] <simple_mapper; location="indir", prefix="f.",
> > > suffix=".in">;
> > > file outfile[] <simple_mapper; location="outdir",
> > > prefix="f.",suffix=".out">;
> > >
> > > foreach f, i in infile {
> > >   outfile[i] = cat(f);
> > > }
> > > login1$
> > >
> > > login1$ which swift
> > > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
> > > login1$ java -version
> > > java version "1.6.0_22"
> > > Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> > > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> > > login1$
> > >
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list