[Swift-user] Deep recursion on subroutine "main::stageout" at /home/ketan/work/worker.pl line 1349

Ketan Maheshwari ketancmaheshwari at gmail.com
Tue May 22 17:25:48 CDT 2012


I do not see any dir named 'jobs' in my workdir:

following is my workdir and its contents:
$ pwd
/home/ketan/ketan_mars/swift.workdir
$ ls
total 8.0K
drwxrwxr-x  5 ketan 4.0K May 22 17:00
mars-20120522-1700-a0a4l957-e-marswrap-e696rork
drwxrwxr-x  5 ketan 4.0K May 22 17:02
mars-20120522-1702-j6gtml62-k-marswrap-kcj9rork


On Tue, May 22, 2012 at 5:27 PM, Jonathan Monette <jonmon at mcs.anl.gov>wrote:

> The work dir and job dir are two separate things. The work dir is where
> swift sets ups the work directory. The job dir is where the job is run
> from. The job dir is in the jobs directory under the work dir. The job dir
> has symlinks to the data in the shared dir.
>
> On May 22, 2012, at 16:10, Ketan Maheshwari <ketancmaheshwari at gmail.com>
> wrote:
>
> Mike,
>
> The jobdir and the workdir are the same right? At least that is what the
> pwd in my marswrapper shows.
>
> The following is the stdout section of swiftwrap:
>
> _____________________________________________________________________________
>
>         stdout
>
> _____________________________________________________________________________
>
> # pwd
>
> /amd/camel/b/ketan/ketan_mars/swift.workdir/mars-20120522-1702-j6gtml62-k-marswrap-kcj9rork
>
> # cp -v home/ketan/ketan_mars/MARS-LIC .
> `home/ketan/ketan_mars/MARS-LIC' -> `./MARS-LIC'
>
> # The error message thrown by mars"
>  <**> ERROR: *** Unable to open License Date File MARS-LIC ***
> ===================
>
> This is why I said Mars is running as if the licence file is not present
> even though it is present.
>
> Also, I do not see any symlinks here in the workdir. They are all real
> files.
>
> On Tue, May 22, 2012 at 1:24 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>> If that path home/ketan/ketan_mars/MARS-LIC is being correctly copied to
>> the workdir (and I stand corrected: thats exactly what should happen) then
>> another possibility is that the program doesnt like getting a symlink for
>> the license file?  Can you test that case externally (outside of Swift)
>> before we go further?
>>
>> You reported the problem as "...the executable still gets into error as
>> if the licence file is not present."
>>
>> The license file will appear to the MARS executable (and the wrapper
>> script) as a symlink (from the jobdir to the workdir, to use the
>> terminology f the Swift User Guide).
>>
>> If that is indeed the problem, your wrapper script might be able to get
>> around this with:
>>  cp MARS-LIC tmplic
>>  rm MARS-LIC
>>  mv tmplic MARS-LIC
>>
>> Exactly what error is MARS generating for this problem?
>>
>> - Mike
>>
>> ----- Original Message -----
>> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > To: "Michael Wilde" <wilde at mcs.anl.gov>
>> > Cc: "Swift User" <swift-user at ci.uchicago.edu>
>> > Sent: Tuesday, May 22, 2012 12:01:49 PM
>> > Subject: Re: [Swift-user] Deep recursion on subroutine "main::stageout"
>> at /home/ketan/work/worker.pl line 1349
>> > The line works fine because Swift creates the dir tree starting at
>> > /home but in the swift.workdir. With -v, I could see the file gets
>> > copied to the cwd and is present there.
>> >
>> >
>> > So, I assume that the wrapper script is not cd'ing me anywhere. So, it
>> > still is a mystery why the app complaint about the file not present
>> > when run from wrapper and it works when run manually in the same dir.
>> >
>> > On Tue, May 22, 2012 at 11:34 AM, Michael Wilde < wilde at mcs.anl.gov >
>> > wrote:
>> >
>> >
>> > Isnt this line problematic if you dont know where the wrapper script
>> > has you cd'ed to:
>> >
>> > cp -v home/ketan/ketan_mars/MARS-LIC .
>> > ^^^
>> >
>> > The relative path doesnt seem safe.
>> >
>> >
>> > - Mike
>> >
>> >
>> > ----- Original Message -----
>> > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>> > > To: "Michael Wilde" < wilde at mcs.anl.gov >
>> > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
>> >
>> >
>> > > Sent: Tuesday, May 22, 2012 10:18:11 AM
>> > > Subject: Re: [Swift-user] Deep recursion on subroutine
>> > > "main::stageout" at /home/ketan/work/ worker.pl line 1349
>> > > Looking this further, I now have a wrapper in place which copies the
>> > > licence file in the cwd before running the executable. However, the
>> > > executable still gets into error as if the licence file is not
>> > > present.
>> > >
>> > >
>> > > When I cd into this dir (swift.workdir/mars-20120519-1203-3l....)
>> > > and
>> > > manually run the executable, it works.
>> > >
>> > >
>> > > So, the question is does the _swiftwrap.staging does some internal
>> > > cd'ing before calling the executable? I will take a look inside, but
>> > > would be useful if someone knows this.
>> > >
>> > >
>> > > The wrapper script is simply the following two lines:
>> > >
>> > >
>> > > """
>> > > cp -v home/ketan/ketan_mars/MARS-LIC .
>> > > /home/ketan/ketan_mars/marsMain $1
>> > > """
>> > >
>> > >
>> > > Regards,
>> > > Ketan
>> > >
>> > >
>> > > On Mon, May 21, 2012 at 7:51 PM, Michael Wilde < wilde at mcs.anl.gov >
>> > > wrote:
>> > >
>> > >
>> > > Im surprised that Swift isn't setting the current working dir (cwd)
>> > > to
>> > > be the job dir, but perhaps that's controlled by this property:
>> > >
>> > > # Determines if Swift remote wrappers will be executed by specifying
>> > > an
>> > > # absolute path, or a path relative to the job initial working
>> > > directory
>> > > #
>> > > # valid values: absolute, relative
>> > > # wrapper.invocation.mode=absolute
>> > >
>> > > Can you try your script with this property set to "relative"?
>> > >
>> > > ...but looking at this further: I see that if youre using coasters
>> > > with provider staging, the logic for job launch is quite different.
>> > > We
>> > > need to study this and get back to you. For now, best to force the
>> > > right cd's with a wrapper. You might be able to remove the wrapper
>> > > later, once we resolve how the job dir management should work in
>> > > these
>> > > various cases.
>> > >
>> > >
>> > > - Mike
>> > >
>> > >
>> > > ----- Original Message -----
>> > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>> > >
>> > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
>> > > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
>> > > > Sent: Monday, May 21, 2012 4:28:02 PM
>> > > > Subject: Re: [Swift-user] Deep recursion on subroutine
>> > > > "main::stageout" at /home/ketan/work/ worker.pl line 1349
>> >
>> >
>> > > > Thanks Mike. Indeed the recursion was a warning.
>> > > >
>> > > >
>> > > > I found the problem was that the binary could not find the licence
>> > > > in
>> > > > the cwd from where it was being called. This is an application
>> > > > requirement that the licence file must be present in the cwd from
>> > > > where the call is made.
>> > > >
>> > > >
>> > > > However, Swift makes a dirtree in the workdir, stages the files
>> > > > and
>> > > > calls the binary from *outside* of this tree. Is it possible to
>> > > > make
>> > > > swift stage the licence file and put it on the top level without
>> > > > writing a wrapper to do a cp. Again, the point of not wrapping the
>> > > > binary into a script is to mimic the Hadoop setup as close as
>> > > > possible.
>> > > >
>> > > >
>> > > > On Mon, May 21, 2012 at 3:35 PM, Michael Wilde < wilde at mcs.anl.gov
>> > > > >
>> > > > wrote:
>> > > >
>> > > >
>> > > > Ketan, as far as I can tell, that message, coming from worker.pl ,
>> > > > is
>> > >
>> > > > just a warning.
>> > > >
>> > > > Programing Perl sec 33, Diagnostic Messages: "Deep recursion on
>> > > > subroutine "%s"
>> > > >
>> > > > (W recursion) This subroutine has called itself (directly or
>> > > > indirectly) 100 times more than it has returned. This probably
>> > > > indicates an infinite recursion, unless you're writing strange
>> > > > benchmark programs, in which case it indicates something else."
>> > > >
>> > > > The stageout code in worker.pl is indeed recursive, and the
>> > > > warning
>> > > > could be suppressed:
>> > > >
>> > > > "Try placing
>> > > >
>> > > > no warnings 'recursion';
>> > > >
>> > > > within the same scope as that code ..."
>> > > >
>> > > > Can you try a simple mod to catsn, using your ext mapper, to see
>> > > > if
>> > > > it
>> > > > is indeed failing due to the deeply recursive stageout?
>> > > >
>> > > > If you could dig a bit deeper into this, and see whether its
>> > > > really
>> > > > failing when staging back so many files or failing for some other,
>> > > > or
>> > > > related, reason, that would be great.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > - Mike
>> > > >
>> > > >
>> > > >
>> > > > ----- Original Message -----
>> > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>> > > > > To: "Swift User" < swift-user at ci.uchicago.edu >
>> > > > > Sent: Monday, May 21, 2012 1:54:34 PM
>> > > > > Subject: [Swift-user] Deep recursion on subroutine
>> > > > > "main::stageout"
>> > >
>> > >
>> > > > > at /home/ketan/work/ worker.pl line 1349
>> > > > > Hi,
>> > > > >
>> > > > >
>> > > > > I am trying to run the GE mars script on a bag of workstations.
>> > > > > I
>> > > > > tested the script for a sufficient number of tasks and seems to
>> > > > > be
>> > > > > working fine on localhost.
>> > > > >
>> > > > >
>> > > > > However, it fails in this setup. I get the error message as
>> > > > > follows
>> > > > > after seemingly right invocation:
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Find: keepalive(120), reconnect - http://128.84.97.46:41287
>> > > > > Progress: time: Mon, 21 May 2012 14:43:18 -0400 Stage in:7
>> > > > > Submitted:3
>> > > > > Progress: time: Mon, 21 May 2012 14:43:19 -0400 Stage in:8
>> > > > > Active:2
>> > > > > Deep recursion on subroutine "main::stageout" at
>> > > > > /home/ketan/work/
>> > > > > worker.pl line 1349.
>> > > > > Deep recursion on subroutine "main::stageout" at
>> > > > > /home/ketan/work/
>> > > > > worker.pl line 1349.
>> > > > > Progress: time: Mon, 21 May 2012 14:43:20 -0400 Active:3 Stage
>> > > > > out:7
>> > > > >
>> > > > >
>> > > > > Obviously the staging out of results fails and seems that the
>> > > > > number
>> > > > > of files in the stageout stage is causing the error. The
>> > > > > application
>> > > > > needs to stage out about 120 files.
>> > > > >
>> > > > >
>> > > > > One solution I could quickly think of is to wrap the app in a
>> > > > > shell
>> > > > > and zip the outputs making it just one staged out file.
>> > > > >
>> > > > >
>> > > > > However, the current setup would still be useful since we are
>> > > > > trying
>> > > > > to compare the existing Hadoop solution with the Swift one.
>> > > > >
>> > > > >
>> > > > > Is there any possible workaround, some env setting or so that I
>> > > > > could
>> > > > > try and get the stageout going?
>> > > > >
>> > > > >
>> > > > > The logs are:
>> > > > > http://www.mcs.anl.gov/~ketan/mars-20120521-1443-d6q9lr0a.log
>> > > > > and http://www.mcs.anl.gov/~ketan/workerlogs.tgz
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Regards, --
>> > > > > Ketan
>> > > > >
>> > > > >
>> > > > >
>> > > > > _______________________________________________
>> > > > > Swift-user mailing list
>> > > > > Swift-user at ci.uchicago.edu
>> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> > > >
>> > > > --
>> > > > Michael Wilde
>> > > > Computation Institute, University of Chicago
>> > > > Mathematics and Computer Science Division
>> > > > Argonne National Laboratory
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Ketan
>> > >
>> > > --
>> > > Michael Wilde
>> > > Computation Institute, University of Chicago
>> > > Mathematics and Computer Science Division
>> > > Argonne National Laboratory
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Ketan
>> >
>> > --
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Ketan
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>
>
> --
> Ketan
>
>
>  _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120522/9e57b9bc/attachment.html>


More information about the Swift-user mailing list