[Swift-user] Deep recursion on subroutine "main::stageout" at /home/ketan/work/worker.pl line 1349

Mihael Hategan hategan at mcs.anl.gov
Tue May 22 17:42:16 CDT 2012


With provider staging the directory where stuff gets run in (i.e. job
CWD) is set through the sumbmission protocol. In other words,
_swiftwrap.stagiing gets run there.

_swiftwrap.staging does not change directories.

The environment might be different between a swift run and a manual
login and run. Is there maybe something in the environment that your app
uses to look up the license file?

Mihael

On Tue, 2012-05-22 at 17:10 -0400, Ketan Maheshwari wrote:
> Mike,
> 
> 
> The jobdir and the workdir are the same right? At least that is what
> the pwd in my marswrapper shows.
> 
> 
> The following is the stdout section of swiftwrap:
> _____________________________________________________________________________
> 
> 
>         stdout
> _____________________________________________________________________________
> 
> 
> # pwd
> /amd/camel/b/ketan/ketan_mars/swift.workdir/mars-20120522-1702-j6gtml62-k-marswrap-kcj9rork
> 
> 
> # cp -v home/ketan/ketan_mars/MARS-LIC .
> `home/ketan/ketan_mars/MARS-LIC' -> `./MARS-LIC'
> 
> 
> # The error message thrown by mars"
>  <**> ERROR: *** Unable to open License Date File MARS-LIC ***
> ===================
> 
> 
> This is why I said Mars is running as if the licence file is not
> present even though it is present. 
> 
> 
> Also, I do not see any symlinks here in the workdir. They are all real
> files.
> 
> On Tue, May 22, 2012 at 1:24 PM, Michael Wilde <wilde at mcs.anl.gov>
> wrote:
>         If that path home/ketan/ketan_mars/MARS-LIC is being correctly
>         copied to the workdir (and I stand corrected: thats exactly
>         what should happen) then another possibility is that the
>         program doesnt like getting a symlink for the license file?
>          Can you test that case externally (outside of Swift) before
>         we go further?
>         
>         You reported the problem as "...the executable still gets into
>         error as if the licence file is not present."
>         
>         The license file will appear to the MARS executable (and the
>         wrapper script) as a symlink (from the jobdir to the workdir,
>         to use the terminology f the Swift User Guide).
>         
>         If that is indeed the problem, your wrapper script might be
>         able to get around this with:
>          cp MARS-LIC tmplic
>          rm MARS-LIC
>          mv tmplic MARS-LIC
>         
>         Exactly what error is MARS generating for this problem?
>         
>         - Mike
>         
>         ----- Original Message -----
>         > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>         > To: "Michael Wilde" <wilde at mcs.anl.gov>
>         > Cc: "Swift User" <swift-user at ci.uchicago.edu>
>         
>         > Sent: Tuesday, May 22, 2012 12:01:49 PM
>         > Subject: Re: [Swift-user] Deep recursion on subroutine
>         "main::stageout" at /home/ketan/work/worker.pl line 1349
>         
>         > The line works fine because Swift creates the dir tree
>         starting at
>         > /home but in the swift.workdir. With -v, I could see the
>         file gets
>         > copied to the cwd and is present there.
>         >
>         >
>         > So, I assume that the wrapper script is not cd'ing me
>         anywhere. So, it
>         > still is a mystery why the app complaint about the file not
>         present
>         > when run from wrapper and it works when run manually in the
>         same dir.
>         >
>         > On Tue, May 22, 2012 at 11:34 AM, Michael Wilde <
>         wilde at mcs.anl.gov >
>         > wrote:
>         >
>         >
>         > Isnt this line problematic if you dont know where the
>         wrapper script
>         > has you cd'ed to:
>         >
>         > cp -v home/ketan/ketan_mars/MARS-LIC .
>         > ^^^
>         >
>         > The relative path doesnt seem safe.
>         >
>         >
>         > - Mike
>         >
>         >
>         > ----- Original Message -----
>         > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>         > > To: "Michael Wilde" < wilde at mcs.anl.gov >
>         > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
>         >
>         >
>         > > Sent: Tuesday, May 22, 2012 10:18:11 AM
>         > > Subject: Re: [Swift-user] Deep recursion on subroutine
>         
>         > > "main::stageout" at /home/ketan/work/ worker.pl line 1349
>         > > Looking this further, I now have a wrapper in place which
>         copies the
>         > > licence file in the cwd before running the executable.
>         However, the
>         > > executable still gets into error as if the licence file is
>         not
>         > > present.
>         > >
>         > >
>         > > When I cd into this dir
>         (swift.workdir/mars-20120519-1203-3l....)
>         > > and
>         > > manually run the executable, it works.
>         > >
>         > >
>         > > So, the question is does the _swiftwrap.staging does some
>         internal
>         > > cd'ing before calling the executable? I will take a look
>         inside, but
>         > > would be useful if someone knows this.
>         > >
>         > >
>         > > The wrapper script is simply the following two lines:
>         > >
>         > >
>         > > """
>         > > cp -v home/ketan/ketan_mars/MARS-LIC .
>         > > /home/ketan/ketan_mars/marsMain $1
>         > > """
>         > >
>         > >
>         > > Regards,
>         > > Ketan
>         > >
>         > >
>         > > On Mon, May 21, 2012 at 7:51 PM, Michael Wilde <
>         wilde at mcs.anl.gov >
>         > > wrote:
>         > >
>         > >
>         > > Im surprised that Swift isn't setting the current working
>         dir (cwd)
>         > > to
>         > > be the job dir, but perhaps that's controlled by this
>         property:
>         > >
>         > > # Determines if Swift remote wrappers will be executed by
>         specifying
>         > > an
>         > > # absolute path, or a path relative to the job initial
>         working
>         > > directory
>         > > #
>         > > # valid values: absolute, relative
>         > > # wrapper.invocation.mode=absolute
>         > >
>         > > Can you try your script with this property set to
>         "relative"?
>         > >
>         > > ...but looking at this further: I see that if youre using
>         coasters
>         > > with provider staging, the logic for job launch is quite
>         different.
>         > > We
>         > > need to study this and get back to you. For now, best to
>         force the
>         > > right cd's with a wrapper. You might be able to remove the
>         wrapper
>         > > later, once we resolve how the job dir management should
>         work in
>         > > these
>         > > various cases.
>         > >
>         > >
>         > > - Mike
>         > >
>         > >
>         > > ----- Original Message -----
>         > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>         > >
>         > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
>         > > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
>         > > > Sent: Monday, May 21, 2012 4:28:02 PM
>         > > > Subject: Re: [Swift-user] Deep recursion on subroutine
>         > > > "main::stageout" at /home/ketan/work/ worker.pl line
>         1349
>         >
>         >
>         > > > Thanks Mike. Indeed the recursion was a warning.
>         > > >
>         > > >
>         > > > I found the problem was that the binary could not find
>         the licence
>         > > > in
>         > > > the cwd from where it was being called. This is an
>         application
>         > > > requirement that the licence file must be present in the
>         cwd from
>         > > > where the call is made.
>         > > >
>         > > >
>         > > > However, Swift makes a dirtree in the workdir, stages
>         the files
>         > > > and
>         > > > calls the binary from *outside* of this tree. Is it
>         possible to
>         > > > make
>         > > > swift stage the licence file and put it on the top level
>         without
>         > > > writing a wrapper to do a cp. Again, the point of not
>         wrapping the
>         > > > binary into a script is to mimic the Hadoop setup as
>         close as
>         > > > possible.
>         > > >
>         > > >
>         > > > On Mon, May 21, 2012 at 3:35 PM, Michael Wilde <
>         wilde at mcs.anl.gov
>         > > > >
>         > > > wrote:
>         > > >
>         > > >
>         > > > Ketan, as far as I can tell, that message, coming from
>         worker.pl ,
>         > > > is
>         > >
>         > > > just a warning.
>         > > >
>         > > > Programing Perl sec 33, Diagnostic Messages: "Deep
>         recursion on
>         > > > subroutine "%s"
>         > > >
>         > > > (W recursion) This subroutine has called itself
>         (directly or
>         > > > indirectly) 100 times more than it has returned. This
>         probably
>         > > > indicates an infinite recursion, unless you're writing
>         strange
>         > > > benchmark programs, in which case it indicates something
>         else."
>         > > >
>         > > > The stageout code in worker.pl is indeed recursive, and
>         the
>         > > > warning
>         > > > could be suppressed:
>         > > >
>         > > > "Try placing
>         > > >
>         > > > no warnings 'recursion';
>         > > >
>         > > > within the same scope as that code ..."
>         > > >
>         > > > Can you try a simple mod to catsn, using your ext
>         mapper, to see
>         > > > if
>         > > > it
>         > > > is indeed failing due to the deeply recursive stageout?
>         > > >
>         > > > If you could dig a bit deeper into this, and see whether
>         its
>         > > > really
>         > > > failing when staging back so many files or failing for
>         some other,
>         > > > or
>         > > > related, reason, that would be great.
>         > > >
>         > > > Thanks,
>         > > >
>         > > > - Mike
>         > > >
>         > > >
>         > > >
>         > > > ----- Original Message -----
>         > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com
>         >
>         > > > > To: "Swift User" < swift-user at ci.uchicago.edu >
>         > > > > Sent: Monday, May 21, 2012 1:54:34 PM
>         > > > > Subject: [Swift-user] Deep recursion on subroutine
>         > > > > "main::stageout"
>         > >
>         > >
>         > > > > at /home/ketan/work/ worker.pl line 1349
>         > > > > Hi,
>         > > > >
>         > > > >
>         > > > > I am trying to run the GE mars script on a bag of
>         workstations.
>         > > > > I
>         > > > > tested the script for a sufficient number of tasks and
>         seems to
>         > > > > be
>         > > > > working fine on localhost.
>         > > > >
>         > > > >
>         > > > > However, it fails in this setup. I get the error
>         message as
>         > > > > follows
>         > > > > after seemingly right invocation:
>         > > > >
>         > > > >
>         > > > >
>         > > > >
>         > > > > Find: keepalive(120), reconnect -
>         http://128.84.97.46:41287
>         > > > > Progress: time: Mon, 21 May 2012 14:43:18 -0400 Stage
>         in:7
>         > > > > Submitted:3
>         > > > > Progress: time: Mon, 21 May 2012 14:43:19 -0400 Stage
>         in:8
>         > > > > Active:2
>         > > > > Deep recursion on subroutine "main::stageout" at
>         > > > > /home/ketan/work/
>         > > > > worker.pl line 1349.
>         > > > > Deep recursion on subroutine "main::stageout" at
>         > > > > /home/ketan/work/
>         > > > > worker.pl line 1349.
>         > > > > Progress: time: Mon, 21 May 2012 14:43:20 -0400
>         Active:3 Stage
>         > > > > out:7
>         > > > >
>         > > > >
>         > > > > Obviously the staging out of results fails and seems
>         that the
>         > > > > number
>         > > > > of files in the stageout stage is causing the error.
>         The
>         > > > > application
>         > > > > needs to stage out about 120 files.
>         > > > >
>         > > > >
>         > > > > One solution I could quickly think of is to wrap the
>         app in a
>         > > > > shell
>         > > > > and zip the outputs making it just one staged out
>         file.
>         > > > >
>         > > > >
>         > > > > However, the current setup would still be useful since
>         we are
>         > > > > trying
>         > > > > to compare the existing Hadoop solution with the Swift
>         one.
>         > > > >
>         > > > >
>         > > > > Is there any possible workaround, some env setting or
>         so that I
>         > > > > could
>         > > > > try and get the stageout going?
>         > > > >
>         > > > >
>         > > > > The logs are:
>         > > > >
>         http://www.mcs.anl.gov/~ketan/mars-20120521-1443-d6q9lr0a.log
>         > > > > and http://www.mcs.anl.gov/~ketan/workerlogs.tgz
>         > > > >
>         > > > >
>         > > > >
>         > > > >
>         > > > > Regards, --
>         > > > > Ketan
>         > > > >
>         > > > >
>         > > > >
>         > > > > _______________________________________________
>         > > > > Swift-user mailing list
>         > > > > Swift-user at ci.uchicago.edu
>         > > > >
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>         > > >
>         > > > --
>         > > > Michael Wilde
>         > > > Computation Institute, University of Chicago
>         > > > Mathematics and Computer Science Division
>         > > > Argonne National Laboratory
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > > --
>         > > > Ketan
>         > >
>         > > --
>         > > Michael Wilde
>         > > Computation Institute, University of Chicago
>         > > Mathematics and Computer Science Division
>         > > Argonne National Laboratory
>         > >
>         > >
>         > >
>         > >
>         > >
>         > > --
>         > > Ketan
>         >
>         > --
>         > Michael Wilde
>         > Computation Institute, University of Chicago
>         > Mathematics and Computer Science Division
>         > Argonne National Laboratory
>         >
>         >
>         >
>         >
>         >
>         > --
>         > Ketan
>         
>         --
>         Michael Wilde
>         Computation Institute, University of Chicago
>         Mathematics and Computer Science Division
>         Argonne National Laboratory
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user





More information about the Swift-user mailing list