[Swift-devel] Re: Provider staging error in long-running test

Mihael Hategan hategan at mcs.anl.gov
Sat Nov 27 19:41:46 CST 2010


So I think that was due to incorrect assumption that message headers
will never be broken up into pieces by the TCP layer. That caused the
worker to fail, presumably under high load, but I cannot be sure about
the exact conditions that led to the problem (and therefore I cannot be
sure of the solution).

I have added code to read things from a socket in a more resilient
fashion.

I have also removed the idle timeout from the worker. That should not
bother us any more.

Mihael

On Mon, 2010-11-22 at 18:29 -0800, Mihael Hategan wrote:
> Ok. So that doesn't look like it's a staging problem specifically, but
> more like something with the comm library. I'll have to look at the
> logs. And I can foresee some free time coming in a couple of days just
> for that!
> 
> Mihael
> 
> On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote:
> > Mihael, here is bug 3:
> > 
> > I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each.  The test hit an error after around 3,491 files:
> > 
> > Progress:  Selecting site:1008  Submitted:12  Active:3  Finished successfully:3476
> > Progress:  Selecting site:1008  Submitted:13  Active:3  Finished successfully:3491
> > Failed to shut down channel
> > java.lang.NullPointerException
> >         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
> >         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
> > 
> > The test was executed on PADS login1 like this:
> > 
> > cd /home/wilde/swift/lab
> > ./run.local.coast.ps.sh catsall
> > 
> > log file: catsall-20101121-2239-oc2flmn0.log
> > 
> > sites.xml:
> > 
> > <config>
> >   <pool handle="localhost">
> >     <!-- <execution provider="coaster-persistent" url="http://login1.pads.ci.uchicago.edu:" jobmanager="local:local"/> -->
> >     <execution provider="coaster" url="none" jobmanager="local:local"/>
> >     <!-- <profile namespace="globus" key="workerManager">passive</profile> -->
> >     <profile namespace="globus" key="workersPerNode">8</profile>
> >     <profile namespace="globus" key="slots">1</profile>
> >     <profile namespace="globus" key="maxnodes">1</profile>
> >     <profile key="jobThrottle" namespace="karajan">.15</profile>
> >     <profile namespace="karajan" key="initialScore">10000</profile>
> >     <profile namespace="swift" key="stagingMethod">proxy</profile>
> >     <workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
> >   </pool>
> > </config>
> > 
> > login1$ cat cf
> > wrapperlog.always.transfer=true
> > sitedir.keep=true
> > execution.retries=0
> > lazy.errors=false
> > status.mode=provider
> > use.provider.staging=true
> > provider.staging.pin.swiftfiles=false
> > login1$ cat catsall.swift
> > type file;
> > 
> > app (file o) cat (file i)
> > {
> >   cat @i stdout=@o;
> > }
> > 
> > file infile[]  <simple_mapper; location="indir", prefix="f.", suffix=".in">;
> > file outfile[] <simple_mapper; location="outdir", prefix="f.",suffix=".out">;
> > 
> > foreach f, i in infile {
> >   outfile[i] = cat(f);
> > }
> > login1$ 
> > 
> > login1$ which swift
> > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
> > login1$ java -version
> > java version "1.6.0_22"
> > Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> > login1$ 
> > 
> > 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list