[Swift-user] Deep recursion on subroutine "main::stageout" at /home/ketan/work/worker.pl line 1349

Ketan Maheshwari ketancmaheshwari at gmail.com
Tue May 22 19:54:22 CDT 2012


Mihael,

As far as I know there is no environment setup required before running mars.

I do see the lic file on putting ls -l in wrapper script and pwd seems to
be showing the expected dir:

#pwd
/nfs2/ketan/ketan_mars/swift.workdir/mars-20120522-1933-44hycbr1-e-marswrap-et4o0prk

#cp -v
`home/ketan/ketan_mars/MARS-LIC' -> `./MARS-LIC'

#ls -l
total 4
-rw-r--r-- 1 ketan collab    0 2012-05-22 19:33 3
drwxr-xr-x 3 ketan collab    3 2012-05-22 19:33 home
-rw-r--r-- 1 ketan collab   75 2012-05-22 19:33 MARS-LIC
drwxr-xr-x 2 ketan collab    3 2012-05-22 19:33 outs
drwxr-xr-x 2 ketan collab    2 2012-05-22 19:33 result0
-rw-r--r-- 1 ketan collab    0 2012-05-22 19:33 stderr.txt
-rw-r--r-- 1 ketan collab 6070 2012-05-22 19:33 _swiftwrap.staging
-rw-r--r-- 1 ketan collab 5729 2012-05-22 19:33 wrapper.log


#still the error message
 <**> ERROR: *** Unable to open License Date File MARS-LIC ***


When I run Mars manually from the same dir it works:
[steamroller:mars-20120522-1933-44hycbr1-e-marswrap-et4o0prk]$
/home/ketan/ketan_mars/marsMain home/ketan/ketan_mars/ctlfiles/mars.ctl.0
# normal output

This time I tried the same setup on MCS cluster and the result is the same
as with the Cornell one.

On Tue, May 22, 2012 at 7:07 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> And, Ketan: can you put an ls -l and pwd in your wrapper script, to get
> some more diagnostic info?
>
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift User" <
> swift-user at ci.uchicago.edu>
> > Sent: Tuesday, May 22, 2012 5:42:16 PM
> > Subject: Re: [Swift-user] Deep recursion on subroutine "main::stageout"
> at /home/ketan/work/worker.pl line 1349
> > With provider staging the directory where stuff gets run in (i.e. job
> > CWD) is set through the sumbmission protocol. In other words,
> > _swiftwrap.stagiing gets run there.
> >
> > _swiftwrap.staging does not change directories.
> >
> > The environment might be different between a swift run and a manual
> > login and run. Is there maybe something in the environment that your
> > app
> > uses to look up the license file?
> >
> > Mihael
> >
> > On Tue, 2012-05-22 at 17:10 -0400, Ketan Maheshwari wrote:
> > > Mike,
> > >
> > >
> > > The jobdir and the workdir are the same right? At least that is what
> > > the pwd in my marswrapper shows.
> > >
> > >
> > > The following is the stdout section of swiftwrap:
> > >
> _____________________________________________________________________________
> > >
> > >
> > >         stdout
> > >
> _____________________________________________________________________________
> > >
> > >
> > > # pwd
> > >
> /amd/camel/b/ketan/ketan_mars/swift.workdir/mars-20120522-1702-j6gtml62-k-marswrap-kcj9rork
> > >
> > >
> > > # cp -v home/ketan/ketan_mars/MARS-LIC .
> > > `home/ketan/ketan_mars/MARS-LIC' -> `./MARS-LIC'
> > >
> > >
> > > # The error message thrown by mars"
> > >  <**> ERROR: *** Unable to open License Date File MARS-LIC ***
> > > ===================
> > >
> > >
> > > This is why I said Mars is running as if the licence file is not
> > > present even though it is present.
> > >
> > >
> > > Also, I do not see any symlinks here in the workdir. They are all
> > > real
> > > files.
> > >
> > > On Tue, May 22, 2012 at 1:24 PM, Michael Wilde <wilde at mcs.anl.gov>
> > > wrote:
> > >         If that path home/ketan/ketan_mars/MARS-LIC is being
> > >         correctly
> > >         copied to the workdir (and I stand corrected: thats exactly
> > >         what should happen) then another possibility is that the
> > >         program doesnt like getting a symlink for the license file?
> > >          Can you test that case externally (outside of Swift) before
> > >         we go further?
> > >
> > >         You reported the problem as "...the executable still gets
> > >         into
> > >         error as if the licence file is not present."
> > >
> > >         The license file will appear to the MARS executable (and the
> > >         wrapper script) as a symlink (from the jobdir to the
> > >         workdir,
> > >         to use the terminology f the Swift User Guide).
> > >
> > >         If that is indeed the problem, your wrapper script might be
> > >         able to get around this with:
> > >          cp MARS-LIC tmplic
> > >          rm MARS-LIC
> > >          mv tmplic MARS-LIC
> > >
> > >         Exactly what error is MARS generating for this problem?
> > >
> > >         - Mike
> > >
> > >         ----- Original Message -----
> > >         > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > >         > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > >         > Cc: "Swift User" <swift-user at ci.uchicago.edu>
> > >
> > >         > Sent: Tuesday, May 22, 2012 12:01:49 PM
> > >         > Subject: Re: [Swift-user] Deep recursion on subroutine
> > >         "main::stageout" at /home/ketan/work/worker.pl line 1349
> > >
> > >         > The line works fine because Swift creates the dir tree
> > >         starting at
> > >         > /home but in the swift.workdir. With -v, I could see the
> > >         file gets
> > >         > copied to the cwd and is present there.
> > >         >
> > >         >
> > >         > So, I assume that the wrapper script is not cd'ing me
> > >         anywhere. So, it
> > >         > still is a mystery why the app complaint about the file
> > >         > not
> > >         present
> > >         > when run from wrapper and it works when run manually in
> > >         > the
> > >         same dir.
> > >         >
> > >         > On Tue, May 22, 2012 at 11:34 AM, Michael Wilde <
> > >         wilde at mcs.anl.gov >
> > >         > wrote:
> > >         >
> > >         >
> > >         > Isnt this line problematic if you dont know where the
> > >         wrapper script
> > >         > has you cd'ed to:
> > >         >
> > >         > cp -v home/ketan/ketan_mars/MARS-LIC .
> > >         > ^^^
> > >         >
> > >         > The relative path doesnt seem safe.
> > >         >
> > >         >
> > >         > - Mike
> > >         >
> > >         >
> > >         > ----- Original Message -----
> > >         > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > >         > > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > >         > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
> > >         >
> > >         >
> > >         > > Sent: Tuesday, May 22, 2012 10:18:11 AM
> > >         > > Subject: Re: [Swift-user] Deep recursion on subroutine
> > >
> > >         > > "main::stageout" at /home/ketan/work/ worker.pl line
> > >         > > 1349
> > >         > > Looking this further, I now have a wrapper in place
> > >         > > which
> > >         copies the
> > >         > > licence file in the cwd before running the executable.
> > >         However, the
> > >         > > executable still gets into error as if the licence file
> > >         > > is
> > >         not
> > >         > > present.
> > >         > >
> > >         > >
> > >         > > When I cd into this dir
> > >         (swift.workdir/mars-20120519-1203-3l....)
> > >         > > and
> > >         > > manually run the executable, it works.
> > >         > >
> > >         > >
> > >         > > So, the question is does the _swiftwrap.staging does
> > >         > > some
> > >         internal
> > >         > > cd'ing before calling the executable? I will take a look
> > >         inside, but
> > >         > > would be useful if someone knows this.
> > >         > >
> > >         > >
> > >         > > The wrapper script is simply the following two lines:
> > >         > >
> > >         > >
> > >         > > """
> > >         > > cp -v home/ketan/ketan_mars/MARS-LIC .
> > >         > > /home/ketan/ketan_mars/marsMain $1
> > >         > > """
> > >         > >
> > >         > >
> > >         > > Regards,
> > >         > > Ketan
> > >         > >
> > >         > >
> > >         > > On Mon, May 21, 2012 at 7:51 PM, Michael Wilde <
> > >         wilde at mcs.anl.gov >
> > >         > > wrote:
> > >         > >
> > >         > >
> > >         > > Im surprised that Swift isn't setting the current
> > >         > > working
> > >         dir (cwd)
> > >         > > to
> > >         > > be the job dir, but perhaps that's controlled by this
> > >         property:
> > >         > >
> > >         > > # Determines if Swift remote wrappers will be executed
> > >         > > by
> > >         specifying
> > >         > > an
> > >         > > # absolute path, or a path relative to the job initial
> > >         working
> > >         > > directory
> > >         > > #
> > >         > > # valid values: absolute, relative
> > >         > > # wrapper.invocation.mode=absolute
> > >         > >
> > >         > > Can you try your script with this property set to
> > >         "relative"?
> > >         > >
> > >         > > ...but looking at this further: I see that if youre
> > >         > > using
> > >         coasters
> > >         > > with provider staging, the logic for job launch is quite
> > >         different.
> > >         > > We
> > >         > > need to study this and get back to you. For now, best to
> > >         force the
> > >         > > right cd's with a wrapper. You might be able to remove
> > >         > > the
> > >         wrapper
> > >         > > later, once we resolve how the job dir management should
> > >         work in
> > >         > > these
> > >         > > various cases.
> > >         > >
> > >         > >
> > >         > > - Mike
> > >         > >
> > >         > >
> > >         > > ----- Original Message -----
> > >         > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com
> > >         > > > >
> > >         > >
> > >         > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > >         > > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
> > >         > > > Sent: Monday, May 21, 2012 4:28:02 PM
> > >         > > > Subject: Re: [Swift-user] Deep recursion on subroutine
> > >         > > > "main::stageout" at /home/ketan/work/ worker.pl line
> > >         1349
> > >         >
> > >         >
> > >         > > > Thanks Mike. Indeed the recursion was a warning.
> > >         > > >
> > >         > > >
> > >         > > > I found the problem was that the binary could not find
> > >         the licence
> > >         > > > in
> > >         > > > the cwd from where it was being called. This is an
> > >         application
> > >         > > > requirement that the licence file must be present in
> > >         > > > the
> > >         cwd from
> > >         > > > where the call is made.
> > >         > > >
> > >         > > >
> > >         > > > However, Swift makes a dirtree in the workdir, stages
> > >         the files
> > >         > > > and
> > >         > > > calls the binary from *outside* of this tree. Is it
> > >         possible to
> > >         > > > make
> > >         > > > swift stage the licence file and put it on the top
> > >         > > > level
> > >         without
> > >         > > > writing a wrapper to do a cp. Again, the point of not
> > >         wrapping the
> > >         > > > binary into a script is to mimic the Hadoop setup as
> > >         close as
> > >         > > > possible.
> > >         > > >
> > >         > > >
> > >         > > > On Mon, May 21, 2012 at 3:35 PM, Michael Wilde <
> > >         wilde at mcs.anl.gov
> > >         > > > >
> > >         > > > wrote:
> > >         > > >
> > >         > > >
> > >         > > > Ketan, as far as I can tell, that message, coming from
> > >         worker.pl ,
> > >         > > > is
> > >         > >
> > >         > > > just a warning.
> > >         > > >
> > >         > > > Programing Perl sec 33, Diagnostic Messages: "Deep
> > >         recursion on
> > >         > > > subroutine "%s"
> > >         > > >
> > >         > > > (W recursion) This subroutine has called itself
> > >         (directly or
> > >         > > > indirectly) 100 times more than it has returned. This
> > >         probably
> > >         > > > indicates an infinite recursion, unless you're writing
> > >         strange
> > >         > > > benchmark programs, in which case it indicates
> > >         > > > something
> > >         else."
> > >         > > >
> > >         > > > The stageout code in worker.pl is indeed recursive,
> > >         > > > and
> > >         the
> > >         > > > warning
> > >         > > > could be suppressed:
> > >         > > >
> > >         > > > "Try placing
> > >         > > >
> > >         > > > no warnings 'recursion';
> > >         > > >
> > >         > > > within the same scope as that code ..."
> > >         > > >
> > >         > > > Can you try a simple mod to catsn, using your ext
> > >         mapper, to see
> > >         > > > if
> > >         > > > it
> > >         > > > is indeed failing due to the deeply recursive
> > >         > > > stageout?
> > >         > > >
> > >         > > > If you could dig a bit deeper into this, and see
> > >         > > > whether
> > >         its
> > >         > > > really
> > >         > > > failing when staging back so many files or failing for
> > >         some other,
> > >         > > > or
> > >         > > > related, reason, that would be great.
> > >         > > >
> > >         > > > Thanks,
> > >         > > >
> > >         > > > - Mike
> > >         > > >
> > >         > > >
> > >         > > >
> > >         > > > ----- Original Message -----
> > >         > > > > From: "Ketan Maheshwari" <
> > >         > > > > ketancmaheshwari at gmail.com
> > >         >
> > >         > > > > To: "Swift User" < swift-user at ci.uchicago.edu >
> > >         > > > > Sent: Monday, May 21, 2012 1:54:34 PM
> > >         > > > > Subject: [Swift-user] Deep recursion on subroutine
> > >         > > > > "main::stageout"
> > >         > >
> > >         > >
> > >         > > > > at /home/ketan/work/ worker.pl line 1349
> > >         > > > > Hi,
> > >         > > > >
> > >         > > > >
> > >         > > > > I am trying to run the GE mars script on a bag of
> > >         workstations.
> > >         > > > > I
> > >         > > > > tested the script for a sufficient number of tasks
> > >         > > > > and
> > >         seems to
> > >         > > > > be
> > >         > > > > working fine on localhost.
> > >         > > > >
> > >         > > > >
> > >         > > > > However, it fails in this setup. I get the error
> > >         message as
> > >         > > > > follows
> > >         > > > > after seemingly right invocation:
> > >         > > > >
> > >         > > > >
> > >         > > > >
> > >         > > > >
> > >         > > > > Find: keepalive(120), reconnect -
> > >         http://128.84.97.46:41287
> > >         > > > > Progress: time: Mon, 21 May 2012 14:43:18 -0400
> > >         > > > > Stage
> > >         in:7
> > >         > > > > Submitted:3
> > >         > > > > Progress: time: Mon, 21 May 2012 14:43:19 -0400
> > >         > > > > Stage
> > >         in:8
> > >         > > > > Active:2
> > >         > > > > Deep recursion on subroutine "main::stageout" at
> > >         > > > > /home/ketan/work/
> > >         > > > > worker.pl line 1349.
> > >         > > > > Deep recursion on subroutine "main::stageout" at
> > >         > > > > /home/ketan/work/
> > >         > > > > worker.pl line 1349.
> > >         > > > > Progress: time: Mon, 21 May 2012 14:43:20 -0400
> > >         Active:3 Stage
> > >         > > > > out:7
> > >         > > > >
> > >         > > > >
> > >         > > > > Obviously the staging out of results fails and seems
> > >         that the
> > >         > > > > number
> > >         > > > > of files in the stageout stage is causing the error.
> > >         The
> > >         > > > > application
> > >         > > > > needs to stage out about 120 files.
> > >         > > > >
> > >         > > > >
> > >         > > > > One solution I could quickly think of is to wrap the
> > >         app in a
> > >         > > > > shell
> > >         > > > > and zip the outputs making it just one staged out
> > >         file.
> > >         > > > >
> > >         > > > >
> > >         > > > > However, the current setup would still be useful
> > >         > > > > since
> > >         we are
> > >         > > > > trying
> > >         > > > > to compare the existing Hadoop solution with the
> > >         > > > > Swift
> > >         one.
> > >         > > > >
> > >         > > > >
> > >         > > > > Is there any possible workaround, some env setting
> > >         > > > > or
> > >         so that I
> > >         > > > > could
> > >         > > > > try and get the stageout going?
> > >         > > > >
> > >         > > > >
> > >         > > > > The logs are:
> > >         > > > >
> > >         http://www.mcs.anl.gov/~ketan/mars-20120521-1443-d6q9lr0a.log
> > >         > > > > and http://www.mcs.anl.gov/~ketan/workerlogs.tgz
> > >         > > > >
> > >         > > > >
> > >         > > > >
> > >         > > > >
> > >         > > > > Regards, --
> > >         > > > > Ketan
> > >         > > > >
> > >         > > > >
> > >         > > > >
> > >         > > > > _______________________________________________
> > >         > > > > Swift-user mailing list
> > >         > > > > Swift-user at ci.uchicago.edu
> > >         > > > >
> > >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >         > > >
> > >         > > > --
> > >         > > > Michael Wilde
> > >         > > > Computation Institute, University of Chicago
> > >         > > > Mathematics and Computer Science Division
> > >         > > > Argonne National Laboratory
> > >         > > >
> > >         > > >
> > >         > > >
> > >         > > >
> > >         > > >
> > >         > > > --
> > >         > > > Ketan
> > >         > >
> > >         > > --
> > >         > > Michael Wilde
> > >         > > Computation Institute, University of Chicago
> > >         > > Mathematics and Computer Science Division
> > >         > > Argonne National Laboratory
> > >         > >
> > >         > >
> > >         > >
> > >         > >
> > >         > >
> > >         > > --
> > >         > > Ketan
> > >         >
> > >         > --
> > >         > Michael Wilde
> > >         > Computation Institute, University of Chicago
> > >         > Mathematics and Computer Science Division
> > >         > Argonne National Laboratory
> > >         >
> > >         >
> > >         >
> > >         >
> > >         >
> > >         > --
> > >         > Ketan
> > >
> > >         --
> > >         Michael Wilde
> > >         Computation Institute, University of Chicago
> > >         Mathematics and Computer Science Division
> > >         Argonne National Laboratory
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120522/04778b2c/attachment.html>


More information about the Swift-user mailing list