provider staging to remote sites (was [Swift-user] Re: 3rd party transfers)

Michael Wilde wilde at mcs.anl.gov
Fri Dec 10 12:06:42 CST 2010


You should post logs and details of the failure to swift-devel for Mihael to diagnose.

In the meantime, you should test between two more local machines - eg from bridled to communicado.  Then to maybe more distant machines.

You should check to make sure that you have not run /tmp out of space on the dest site.  Perhaps you clobbered the dest node by driving its root fs out of space?

I dont know how provider staging picks the dest directory: whether its hardwired to /tmp (which would be bad and need to get fixed) or if it honors the <scratch> tag which would be great, in which case you should set that to $OSG_WN_TMP for OSG sites.

At any rate, try this first under a more controlled environment where you can more closely observe both the client and server, and stress test that scenario first, much like I did on the localhost scenario.  Then gradually move to OSG once you know how the provider staging mechanism behaves.

- Mike


----- Original Message -----
> Hi Mike,
> 
> I temporarily converted my absolute path references to relative one
> with symlinks. But jobs started to fail at 800 files:
> 
> 2010-12-10 11:26:31,116-0600 INFO vdl:execute Exception in cat:
> Arguments: [RuptureVariations/100/3/100_3.txt.variation-s0002-h0003]
> Host: Firefly_ff-grid.unl.edu
> Directory: catsall-20101210-1126-g172ithb/jobs/k/cat-kxrvat2kTODO:
> outs
> ----
> 
> Caused by: Task failed: null
> java.lang.IllegalStateException: Timer already cancelled.
> at java.util.Timer.sched(Timer.java:354)
> at java.util.Timer.schedule(Timer.java:170)
> at
> org.globus.cog.karajan.workflow.service.commands.Command.setupReplyTimeoutChecker(Command.java:156)
> at
> org.globus.cog.karajan.workflow.service.commands.Command.dataSent(Command.java:150)
> at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:253)
> 
> 
> 2010-12-10 11:26:31,116-0600 DEBUG ConfigProperty Getting property
> pgraph with host null
> 
> 
> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> > Hi Allan,
> >
> > I vaguely recall similar issues with prior tests of provider
> > staging. Based on Mihael's recommendation Ive been using the "proxy"
> > mode. I dont have my head around all the modes at the moment (I did
> > when I first looked at it).
> >
> > At any rate, in my tests using proxy mode, just on localhost, I did
> > not run into any full-pathname problems: I used simple_mapper and
> > unqualified partial pathnames.
> >
> > My test is on the CI net at:
> > /home/wilde/swift/lab/tests/test.local.ps.sh
> > and pasted below. We should build a similar test to validate
> > proxy-mode provider staging on remote sites with coasters. Whoever
> > gets to it first.
> >
> > See if using the pattern below gets you past this full-pathname
> > problem.
> >
> > - Mike
> >
> > bri$ cat ./test.local.ps.sh
> > #! /bin/bash
> >
> > cat >tc <<END
> >
> > localhost sh /bin/sh null null null
> > localhost cat /bin/cat null null null
> >
> > END
> >
> > cat >sites.xml <<END
> >
> > <config>
> >  <pool handle="localhost">
> >    <execution provider="coaster" url="none"
> >    jobmanager="local:local"/>
> >    <profile namespace="globus" key="workersPerNode">8</profile>
> >    <profile namespace="globus" key="slots">1</profile>
> >    <profile namespace="globus" key="maxnodes">1</profile>
> >    <profile key="jobThrottle" namespace="karajan">.15</profile>
> >    <profile namespace="karajan" key="initialScore">10000</profile>
> >    <profile namespace="swift" key="stagingMethod">proxy</profile>
> >    <workdirectory>$PWD</workdirectory>
> >  </pool>
> > </config>
> >
> > END
> >
> > cat >cf <<END
> >
> > wrapperlog.always.transfer=true
> > sitedir.keep=true
> > execution.retries=0
> > lazy.errors=false
> > status.mode=provider
> > use.provider.staging=true
> > provider.staging.pin.swiftfiles=false
> >
> > END
> >
> > cat >pstest.swift <<EOF
> >
> > type file;
> >
> > app (file o) cat (file i)
> > {
> >  cat @i stdout=@o;
> > }
> >
> > file infile[] <simple_mapper; location="indir", prefix="f.",
> > suffix=".in">;
> > file outfile[] <simple_mapper; location="outdir",
> > prefix="f.",suffix=".out">;
> >
> > foreach f, i in infile {
> >  outfile[i] = cat(f);
> > }
> >
> > EOF
> >
> > swift -config cf -tc.file tc -sites.file sites.xml pstest.swift
> > bri$
> >
> >
> > bri$ mkdir outdir
> > bri$ ls
> > indir/ outdir/ test.local.ps.sh
> > bri$ ls indir
> > f.0000.in f.0001.in f.0002.in f.0003.in f.0004.in
> > bri$ ls outdir
> > bri$ ./test.local.ps.sh
> > Swift svn swift-r3758 cog-r2951 (cog modified locally)
> >
> > RunID: 20101210-1108-qsdi3mz6
> > Progress:
> > Progress: Active:4 Finished successfully:1
> > Final status: Finished successfully:5
> > bri$ ls outdir
> > f.0000.out f.0001.out f.0002.out f.0003.out f.0004.out
> > bri$
> >
> >
> > ----- Original Message -----
> >> Hi Mike,
> >>
> >> I'm having problems getting provider staging to work. I seems to
> >> pass
> >> files as absolute references:
> >>
> >> _____________________________________________________________________________
> >>
> >> command line
> >> _____________________________________________________________________________
> >>
> >> -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if
> >> //gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/100/0/100_0.txt.variation-s0000-h0000
> >> -of -k -cdmfile -status provider -a
> >> /gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/100/0/100_0.txt.variation-s0000-h0000
> >>
> >> _____________________________________________________________________________
> >>
> >> stdout
> >> _____________________________________________________________________________
> >>
> >>
> >> _____________________________________________________________________________
> >>
> >> stderr
> >> _____________________________________________________________________________
> >>
> >> /bin/cat:
> >> /gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/100/0/100_0.txt.variation-s0000-h0000:
> >> No such file or directory
> >>
> >> But the remote site does not have /gpfs/pads .
> >>
> >> Should I be modifying my mappers to accomodate this?
> >>
> >> -Allan
> >>
> >>
> >> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> >> > Did you try provider staging, which might be easier to throttle
> >> > given that the staging endpoints are more under Swift's control?
> >> >
> >> > - MIke
> >> >
> >> > ----- Original Message -----
> >> >> Hi Mike.
> >> >>
> >> >> Yes. I had the workflow stagein 1, 10, 40 , 80, 400, 800, 2000,
> >> >> 8000,
> >> >> 30000 files. The throttles are the same for each run. Problems
> >> >> started to occur at around 800 files .
> >> >>
> >> >> For staging in local files, problems started to occur at 30000
> >> >> files
> >> >> where vdl:dostagein hits gpfs too much.
> >> >>
> >> >> -Allan
> >> >>
> >> >>
> >> >> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> >> >> > Allan, did you verify that each remote site you are talking to
> >> >> > in
> >> >> > this test is functional at low transaction rates using your
> >> >> > current
> >> >> > sites configuration?
> >> >> >
> >> >> > I.e., are you certain that the error below is due to load and
> >> >> > not
> >> >> > a
> >> >> > site-related error?
> >> >> >
> >> >> > - Mike
> >> >> >
> >> >> >
> >> >> > ----- Original Message -----
> >> >> >> I tried to have the tests more synthesized using Mike's
> >> >> >> catsall
> >> >> >> workflow staging in ~3 MB data files to 5 OSG sites. Swift
> >> >> >> seem
> >> >> >> to
> >> >> >> handle the transfer well when the originating files are
> >> >> >> local.
> >> >> >> But
> >> >> >> when it starts to use remote file objects, I get all these
> >> >> >> 3rd
> >> >> >> party
> >> >> >> transfer exceptions. my throttle for file transfers is 8 and
> >> >> >> for
> >> >> >> file
> >> >> >> operations is 10.
> >> >> >>
> >> >> >> 2010-12-09 18:58:16,700-0600 DEBUG
> >> >> >> DelegatedFileTransferHandler
> >> >> >> File
> >> >> >> transfer with resource remote->tmp
> >> >> >> 2010-12-09 18:58:16,734-0600 DEBUG
> >> >> >> DelegatedFileTransferHandler
> >> >> >> Exception in transfer
> >> >> >> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
> >> >> >> Exception in getFile
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTP
> >> >> >> FileResource.java:62)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java
> >> >> >> :401)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFil
> >> >> >> eTransferHandler.java:269)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(Cachi
> >> >> >> ngDelegatedFileTransferHandler.java:59)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTran
> >> >> >> sferHandler.java:486)
> >> >> >> at java.lang.Thread.run(Thread.java:619)
> >> >> >> Caused by:
> >> >> >> org.globus.cog.abstraction.impl.file.FileResourceException:
> >> >> >> Failed to retrieve file information
> >> >> >> about
> >> >> >> /projsmall/osg/data/engage/scec/swift_scratch/catsall-20101209-1839-pnazhid6/info/p/cat-p2em5s2k-in
> >> >> >> fo
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTP
> >> >> >> FileResource.java:51)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getGridFile(FileResourceImpl.
> >> >> >> java:550)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java
> >> >> >> :384)
> >> >> >> ... 4 more
> >> >> >> Caused by: org.globus.ftp.exception.ServerException: Server
> >> >> >> refused
> >> >> >> performing the request. Custom message
> >> >> >> : Server refused MLST command (error code 1) [Nested
> >> >> >> exception
> >> >> >> message: Custom message: Unexpected reply:
> >> >> >> 500-Command failed :
> >> >> >> globus_gridftp_server_file.c:globus_l_gfs_file_stat:389:
> >> >> >> 500-System error in stat: No such file or directory
> >> >> >> 500-A system call failed: No such file or directory
> >> >> >> 500 End.] [Nested exception is
> >> >> >> org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
> >> >> >> message: Une
> >> >> >> xpected reply: 500-Command failed :
> >> >> >> globus_gridftp_server_file.c:globus_l_gfs_file_stat:389:
> >> >> >> 500-System error in stat: No such file or directory
> >> >> >> 500-A system call failed: No such file or directory
> >> >> >> 500 End.]
> >> >> >> at
> >> >> >> org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerException.java
> >> >> >> :101)
> >> >> >> at org.globus.ftp.FTPClient.mlst(FTPClient.java:643)
> >> >> >> at
> >> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getGridFile(FileResourceImpl.
> >> >> >> java:546)
> >> >> >> ... 5 more
> >> >> >>
> >> >> >>
> >> >> >> I may have been stressing the source gridftp server too much
> >> >> >> (pads)
> >> >> >> that it cannot handle a throttle of 8 . But at this
> >> >> >> configuration,
> >> >> >> I
> >> >> >> get low transfer performance. When doing direct transfers, I
> >> >> >> was
> >> >> >> able
> >> >> >> to get better transfer rates until i start coking out gpfs at
> >> >> >> 10k
> >> >> >> stageins. My throttle for this configurations was 40 for both
> >> >> >> file
> >> >> >> transfers and file operations.
> >> >> >>
> >> >> >>
> >> >> >> 2010/12/2 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> >> >> >> > I have a bunch of 3rd party gridftp transfers. Swift
> >> >> >> > reports
> >> >> >> > around
> >> >> >> > 10k jobs being in the vdl:stagein at a time. After a while
> >> >> >> > i
> >> >> >> > get
> >> >> >> > a
> >> >> >> > couple of these errors. Does it look like i'm stressing the
> >> >> >> > gridftp
> >> >> >> > servers? my throttle.transfers=8
> >> >> >> >
> >> >> >> > 2010-12-02 02:22:06,008-0600 DEBUG
> >> >> >> > DelegatedFileTransferHandler
> >> >> >> > Starting service on gsiftp://gpn-hus
> >> >> >> > 2010-12-02 02:22:06,008-0600 DEBUG
> >> >> >> > DelegatedFileTransferHandler
> >> >> >> > File
> >> >> >> > transfer with resource local->r
> >> >> >> > 2010-12-02 02:22:06,247-0600 DEBUG
> >> >> >> > DelegatedFileTransferHandler
> >> >> >> > Exception in transfer
> >> >> >> > org.globus.cog.abstraction.impl.file.FileResourceException
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(Abstr
> >> >> >> > esource.java:51)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(Abstr
> >> >> >> > esource.java:34)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImp
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(D
> >> >> >> > eTransferHandler.java:352)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestin
> >> >> >> > ngDelegatedFileTransferHandler.java:46)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFi
> >> >> >> > andler.java:489)
> >> >> >> >        at java.lang.Thread.run(Thread.java:619)
> >> >> >> > Caused by: org.globus.ftp.exception.ServerException: Server
> >> >> >> > refused
> >> >> >> > performing the request. Custom m
> >> >> >> > rror code 1) [Nested exception message: Custom message:
> >> >> >> > Unexpected
> >> >> >> > reply: 451 ocurred during retrie
> >> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
> >> >> >> > must
> >> >> >> > match
> >> >> >> > store() and setActive() - ret
> >> >> >> > rror code 2)
> >> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
> >> >> >> > must
> >> >> >> > match
> >> >> >> > store() and setActive() - ret
> >> >> >> > rror code 2)
> >> >> >> >        at
> >> >> >> >        org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:469)
> >> >> >> >        at org.globus.ftp.FTPClient.put(FTPClient.java:1294)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImp
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(D
> >> >> >> > eTransferHandler.java:352)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestin
> >> >> >> > ngDelegatedFileTransferHandler.java:46)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFi
> >> >> >> > andler.java:489)
> >> >> >> >        at java.lang.Thread.run(Thread.java:619)
> >> >> >> > ] [Nested exception is
> >> >> >> > org.globus.ftp.exception.UnexpectedReplyCodeException:
> >> >> >> > Custom
> >> >> >> > message: Unexp
> >> >> >> > : 451 ocurred during retrieve()
> >> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
> >> >> >> > must
> >> >> >> > match
> >> >> >> > store() and setActive() - ret
> >> >> >> > rror code 2)
> >> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
> >> >> >> > must
> >> >> >> > match
> >> >> >> > store() and setActive() - ret
> >> >> >> > rror code 2)
> >> >> >> >        at
> >> >> >> >        org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:469)
> >> >> >> >        at org.globus.ftp.FTPClient.put(FTPClient.java:1294)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImp
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(D
> >> >> >> > eTransferHandler.java:352)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestin
> >> >> >> > ngDelegatedFileTransferHandler.java:46)
> >> >> >> >        at
> >> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFi
> >> >> >> > andler.java:489)
> >> >> >> >        at java.lang.Thread.run(Thread.java:619)
> >> >> >> > ]
> >> >> >> >        at
> >> >> >> >        org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerExceptio
> >> >> >> >        at
> >> >> >> >        org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerExceptio
> >> >> >> >        at
> >> >> >> >        org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
> >> >> >> >        ... 1 more
> >> >> >> >
> >> >>
> >> >> --
> >> >> Allan M. Espinosa <http://amespinosa.wordpress.com>
> >> >> PhD student, Computer Science
> >> >> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> >> >
> >> > --
> >> > Michael Wilde
> >> > Computation Institute, University of Chicago
> >> > Mathematics and Computer Science Division
> >> > Argonne National Laboratory
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Allan M. Espinosa <http://amespinosa.wordpress.com>
> >> PhD student, Computer Science
> >> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> 
> 
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list