provider staging to remote sites (was [Swift-user] Re: 3rd party transfers)

Allan Espinosa aespinosa at cs.uchicago.edu
Fri Dec 10 11:29:32 CST 2010


Hi Mike,

I temporarily converted my absolute path references to relative one
with symlinks.  But jobs started to fail at 800 files:

2010-12-10 11:26:31,116-0600 INFO  vdl:execute Exception in cat:
Arguments: [RuptureVariations/100/3/100_3.txt.variation-s0002-h0003]
Host: Firefly_ff-grid.unl.edu
Directory: catsall-20101210-1126-g172ithb/jobs/k/cat-kxrvat2kTODO: outs
----

Caused by: Task failed: null
java.lang.IllegalStateException: Timer already cancelled.
        at java.util.Timer.sched(Timer.java:354)
        at java.util.Timer.schedule(Timer.java:170)
        at org.globus.cog.karajan.workflow.service.commands.Command.setupReplyTimeoutChecker(Command.java:156)
        at org.globus.cog.karajan.workflow.service.commands.Command.dataSent(Command.java:150)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:253)


2010-12-10 11:26:31,116-0600 DEBUG ConfigProperty Getting property
pgraph with host null


2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
> Hi Allan,
>
> I vaguely recall similar issues with prior tests of provider staging. Based on Mihael's recommendation Ive been using the "proxy" mode. I dont have my head around all the modes at the moment (I did when I first looked at it).
>
> At any rate, in my tests using proxy mode, just on localhost, I did not run into any full-pathname problems: I used simple_mapper and unqualified partial pathnames.
>
> My test is on the CI net at: /home/wilde/swift/lab/tests/test.local.ps.sh
> and pasted below.  We should build a similar test to validate proxy-mode provider staging on remote sites with coasters.  Whoever gets to it first.
>
> See if using the pattern below gets you past this full-pathname problem.
>
> - Mike
>
> bri$ cat ./test.local.ps.sh
> #! /bin/bash
>
> cat >tc <<END
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
>
> END
>
> cat >sites.xml <<END
>
> <config>
>  <pool handle="localhost">
>    <execution provider="coaster" url="none" jobmanager="local:local"/>
>    <profile namespace="globus" key="workersPerNode">8</profile>
>    <profile namespace="globus" key="slots">1</profile>
>    <profile namespace="globus" key="maxnodes">1</profile>
>    <profile key="jobThrottle" namespace="karajan">.15</profile>
>    <profile namespace="karajan" key="initialScore">10000</profile>
>    <profile namespace="swift" key="stagingMethod">proxy</profile>
>    <workdirectory>$PWD</workdirectory>
>  </pool>
> </config>
>
> END
>
> cat >cf <<END
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=0
> lazy.errors=false
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
>
> END
>
> cat >pstest.swift <<EOF
>
> type file;
>
> app (file o) cat (file i)
> {
>  cat @i stdout=@o;
> }
>
> file infile[]  <simple_mapper; location="indir", prefix="f.", suffix=".in">;
> file outfile[] <simple_mapper; location="outdir", prefix="f.",suffix=".out">;
>
> foreach f, i in infile {
>  outfile[i] = cat(f);
> }
>
> EOF
>
> swift -config cf -tc.file tc -sites.file sites.xml pstest.swift
> bri$
>
>
> bri$ mkdir outdir
> bri$ ls
> indir/  outdir/  test.local.ps.sh
> bri$ ls indir
> f.0000.in  f.0001.in  f.0002.in  f.0003.in  f.0004.in
> bri$ ls outdir
> bri$ ./test.local.ps.sh
> Swift svn swift-r3758 cog-r2951 (cog modified locally)
>
> RunID: 20101210-1108-qsdi3mz6
> Progress:
> Progress:  Active:4  Finished successfully:1
> Final status:  Finished successfully:5
> bri$ ls outdir
> f.0000.out  f.0001.out  f.0002.out  f.0003.out  f.0004.out
> bri$
>
>
> ----- Original Message -----
>> Hi Mike,
>>
>> I'm having problems getting provider staging to work. I seems to pass
>> files as absolute references:
>>
>> _____________________________________________________________________________
>>
>> command line
>> _____________________________________________________________________________
>>
>> -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if
>> //gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/100/0/100_0.txt.variation-s0000-h0000
>> -of -k -cdmfile -status provider -a
>> /gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/100/0/100_0.txt.variation-s0000-h0000
>>
>> _____________________________________________________________________________
>>
>> stdout
>> _____________________________________________________________________________
>>
>>
>> _____________________________________________________________________________
>>
>> stderr
>> _____________________________________________________________________________
>>
>> /bin/cat:
>> /gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/100/0/100_0.txt.variation-s0000-h0000:
>> No such file or directory
>>
>> But the remote site does not have /gpfs/pads .
>>
>> Should I be modifying my mappers to accomodate this?
>>
>> -Allan
>>
>>
>> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
>> > Did you try provider staging, which might be easier to throttle
>> > given that the staging endpoints are more under Swift's control?
>> >
>> > - MIke
>> >
>> > ----- Original Message -----
>> >> Hi Mike.
>> >>
>> >> Yes. I had the workflow stagein 1, 10, 40 , 80, 400, 800, 2000,
>> >> 8000,
>> >> 30000 files. The throttles are the same for each run. Problems
>> >> started to occur at around 800 files .
>> >>
>> >> For staging in local files, problems started to occur at 30000
>> >> files
>> >> where vdl:dostagein hits gpfs too much.
>> >>
>> >> -Allan
>> >>
>> >>
>> >> 2010/12/10 Michael Wilde <wilde at mcs.anl.gov>:
>> >> > Allan, did you verify that each remote site you are talking to in
>> >> > this test is functional at low transaction rates using your
>> >> > current
>> >> > sites configuration?
>> >> >
>> >> > I.e., are you certain that the error below is due to load and not
>> >> > a
>> >> > site-related error?
>> >> >
>> >> > - Mike
>> >> >
>> >> >
>> >> > ----- Original Message -----
>> >> >> I tried to have the tests more synthesized using Mike's catsall
>> >> >> workflow staging in ~3 MB data files to 5 OSG sites. Swift seem
>> >> >> to
>> >> >> handle the transfer well when the originating files are local.
>> >> >> But
>> >> >> when it starts to use remote file objects, I get all these 3rd
>> >> >> party
>> >> >> transfer exceptions. my throttle for file transfers is 8 and for
>> >> >> file
>> >> >> operations is 10.
>> >> >>
>> >> >> 2010-12-09 18:58:16,700-0600 DEBUG DelegatedFileTransferHandler
>> >> >> File
>> >> >> transfer with resource remote->tmp
>> >> >> 2010-12-09 18:58:16,734-0600 DEBUG DelegatedFileTransferHandler
>> >> >> Exception in transfer
>> >> >> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
>> >> >> Exception in getFile
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTP
>> >> >> FileResource.java:62)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java
>> >> >> :401)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFil
>> >> >> eTransferHandler.java:269)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(Cachi
>> >> >> ngDelegatedFileTransferHandler.java:59)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTran
>> >> >> sferHandler.java:486)
>> >> >> at java.lang.Thread.run(Thread.java:619)
>> >> >> Caused by:
>> >> >> org.globus.cog.abstraction.impl.file.FileResourceException:
>> >> >> Failed to retrieve file information
>> >> >> about
>> >> >> /projsmall/osg/data/engage/scec/swift_scratch/catsall-20101209-1839-pnazhid6/info/p/cat-p2em5s2k-in
>> >> >> fo
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(AbstractFTP
>> >> >> FileResource.java:51)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getGridFile(FileResourceImpl.
>> >> >> java:550)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java
>> >> >> :384)
>> >> >> ... 4 more
>> >> >> Caused by: org.globus.ftp.exception.ServerException: Server
>> >> >> refused
>> >> >> performing the request. Custom message
>> >> >> : Server refused MLST command (error code 1) [Nested exception
>> >> >> message: Custom message: Unexpected reply:
>> >> >> 500-Command failed :
>> >> >> globus_gridftp_server_file.c:globus_l_gfs_file_stat:389:
>> >> >> 500-System error in stat: No such file or directory
>> >> >> 500-A system call failed: No such file or directory
>> >> >> 500 End.] [Nested exception is
>> >> >> org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
>> >> >> message: Une
>> >> >> xpected reply: 500-Command failed :
>> >> >> globus_gridftp_server_file.c:globus_l_gfs_file_stat:389:
>> >> >> 500-System error in stat: No such file or directory
>> >> >> 500-A system call failed: No such file or directory
>> >> >> 500 End.]
>> >> >> at
>> >> >> org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerException.java
>> >> >> :101)
>> >> >> at org.globus.ftp.FTPClient.mlst(FTPClient.java:643)
>> >> >> at
>> >> >> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getGridFile(FileResourceImpl.
>> >> >> java:546)
>> >> >> ... 5 more
>> >> >>
>> >> >>
>> >> >> I may have been stressing the source gridftp server too much
>> >> >> (pads)
>> >> >> that it cannot handle a throttle of 8 . But at this
>> >> >> configuration,
>> >> >> I
>> >> >> get low transfer performance. When doing direct transfers, I was
>> >> >> able
>> >> >> to get better transfer rates until i start coking out gpfs at
>> >> >> 10k
>> >> >> stageins. My throttle for this configurations was 40 for both
>> >> >> file
>> >> >> transfers and file operations.
>> >> >>
>> >> >>
>> >> >> 2010/12/2 Allan Espinosa <aespinosa at cs.uchicago.edu>:
>> >> >> > I have a bunch of 3rd party gridftp transfers. Swift reports
>> >> >> > around
>> >> >> > 10k jobs being in the vdl:stagein at a time. After a while i
>> >> >> > get
>> >> >> > a
>> >> >> > couple of these errors. Does it look like i'm stressing the
>> >> >> > gridftp
>> >> >> > servers? my throttle.transfers=8
>> >> >> >
>> >> >> > 2010-12-02 02:22:06,008-0600 DEBUG
>> >> >> > DelegatedFileTransferHandler
>> >> >> > Starting service on gsiftp://gpn-hus
>> >> >> > 2010-12-02 02:22:06,008-0600 DEBUG
>> >> >> > DelegatedFileTransferHandler
>> >> >> > File
>> >> >> > transfer with resource local->r
>> >> >> > 2010-12-02 02:22:06,247-0600 DEBUG
>> >> >> > DelegatedFileTransferHandler
>> >> >> > Exception in transfer
>> >> >> > org.globus.cog.abstraction.impl.file.FileResourceException
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(Abstr
>> >> >> > esource.java:51)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.file.ftp.AbstractFTPFileResource.translateException(Abstr
>> >> >> > esource.java:34)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImp
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(D
>> >> >> > eTransferHandler.java:352)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestin
>> >> >> > ngDelegatedFileTransferHandler.java:46)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFi
>> >> >> > andler.java:489)
>> >> >> >        at java.lang.Thread.run(Thread.java:619)
>> >> >> > Caused by: org.globus.ftp.exception.ServerException: Server
>> >> >> > refused
>> >> >> > performing the request. Custom m
>> >> >> > rror code 1) [Nested exception message: Custom message:
>> >> >> > Unexpected
>> >> >> > reply: 451 ocurred during retrie
>> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
>> >> >> > must
>> >> >> > match
>> >> >> > store() and setActive() - ret
>> >> >> > rror code 2)
>> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
>> >> >> > must
>> >> >> > match
>> >> >> > store() and setActive() - ret
>> >> >> > rror code 2)
>> >> >> >        at
>> >> >> >        org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:469)
>> >> >> >        at org.globus.ftp.FTPClient.put(FTPClient.java:1294)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImp
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(D
>> >> >> > eTransferHandler.java:352)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestin
>> >> >> > ngDelegatedFileTransferHandler.java:46)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFi
>> >> >> > andler.java:489)
>> >> >> >        at java.lang.Thread.run(Thread.java:619)
>> >> >> > ] [Nested exception is
>> >> >> > org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
>> >> >> > message: Unexp
>> >> >> > : 451 ocurred during retrieve()
>> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
>> >> >> > must
>> >> >> > match
>> >> >> > store() and setActive() - ret
>> >> >> > rror code 2)
>> >> >> > org.globus.ftp.exception.DataChannelException: setPassive()
>> >> >> > must
>> >> >> > match
>> >> >> > store() and setActive() - ret
>> >> >> > rror code 2)
>> >> >> >        at
>> >> >> >        org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:469)
>> >> >> >        at org.globus.ftp.FTPClient.put(FTPClient.java:1294)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImp
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(D
>> >> >> > eTransferHandler.java:352)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestin
>> >> >> > ngDelegatedFileTransferHandler.java:46)
>> >> >> >        at
>> >> >> >        org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFi
>> >> >> > andler.java:489)
>> >> >> >        at java.lang.Thread.run(Thread.java:619)
>> >> >> > ]
>> >> >> >        at
>> >> >> >        org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerExceptio
>> >> >> >        at
>> >> >> >        org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerExceptio
>> >> >> >        at
>> >> >> >        org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
>> >> >> >        ... 1 more
>> >> >> >
>> >>
>> >> --
>> >> Allan M. Espinosa <http://amespinosa.wordpress.com>
>> >> PhD student, Computer Science
>> >> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>> >
>> > --
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> >
>> >
>>
>>
>>
>> --
>> Allan M. Espinosa <http://amespinosa.wordpress.com>
>> PhD student, Computer Science
>> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
>



-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>



More information about the Swift-user mailing list