[Swift-user] Deep recursion on subroutine "main::stageout" at /home/ketan/work/worker.pl line 1349
Jonathan Monette
jonmon at mcs.anl.gov
Tue May 22 17:33:13 CDT 2012
The work dir tells swift where to put the work dir. There should be a jobs dir in one of those directory.
On May 22, 2012, at 17:25, Ketan Maheshwari <ketancmaheshwari at gmail.com> wrote:
> I do not see any dir named 'jobs' in my workdir:
>
> following is my workdir and its contents:
> $ pwd
> /home/ketan/ketan_mars/swift.workdir
> $ ls
> total 8.0K
> drwxrwxr-x 5 ketan 4.0K May 22 17:00 mars-20120522-1700-a0a4l957-e-marswrap-e696rork
> drwxrwxr-x 5 ketan 4.0K May 22 17:02 mars-20120522-1702-j6gtml62-k-marswrap-kcj9rork
>
>
> On Tue, May 22, 2012 at 5:27 PM, Jonathan Monette <jonmon at mcs.anl.gov> wrote:
> The work dir and job dir are two separate things. The work dir is where swift sets ups the work directory. The job dir is where the job is run from. The job dir is in the jobs directory under the work dir. The job dir has symlinks to the data in the shared dir.
>
> On May 22, 2012, at 16:10, Ketan Maheshwari <ketancmaheshwari at gmail.com> wrote:
>
>> Mike,
>>
>> The jobdir and the workdir are the same right? At least that is what the pwd in my marswrapper shows.
>>
>> The following is the stdout section of swiftwrap:
>> _____________________________________________________________________________
>>
>> stdout
>> _____________________________________________________________________________
>>
>> # pwd
>> /amd/camel/b/ketan/ketan_mars/swift.workdir/mars-20120522-1702-j6gtml62-k-marswrap-kcj9rork
>>
>> # cp -v home/ketan/ketan_mars/MARS-LIC .
>> `home/ketan/ketan_mars/MARS-LIC' -> `./MARS-LIC'
>>
>> # The error message thrown by mars"
>> <**> ERROR: *** Unable to open License Date File MARS-LIC ***
>> ===================
>>
>> This is why I said Mars is running as if the licence file is not present even though it is present.
>>
>> Also, I do not see any symlinks here in the workdir. They are all real files.
>>
>> On Tue, May 22, 2012 at 1:24 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>> If that path home/ketan/ketan_mars/MARS-LIC is being correctly copied to the workdir (and I stand corrected: thats exactly what should happen) then another possibility is that the program doesnt like getting a symlink for the license file? Can you test that case externally (outside of Swift) before we go further?
>>
>> You reported the problem as "...the executable still gets into error as if the licence file is not present."
>>
>> The license file will appear to the MARS executable (and the wrapper script) as a symlink (from the jobdir to the workdir, to use the terminology f the Swift User Guide).
>>
>> If that is indeed the problem, your wrapper script might be able to get around this with:
>> cp MARS-LIC tmplic
>> rm MARS-LIC
>> mv tmplic MARS-LIC
>>
>> Exactly what error is MARS generating for this problem?
>>
>> - Mike
>>
>> ----- Original Message -----
>> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > To: "Michael Wilde" <wilde at mcs.anl.gov>
>> > Cc: "Swift User" <swift-user at ci.uchicago.edu>
>> > Sent: Tuesday, May 22, 2012 12:01:49 PM
>> > Subject: Re: [Swift-user] Deep recursion on subroutine "main::stageout" at /home/ketan/work/worker.pl line 1349
>> > The line works fine because Swift creates the dir tree starting at
>> > /home but in the swift.workdir. With -v, I could see the file gets
>> > copied to the cwd and is present there.
>> >
>> >
>> > So, I assume that the wrapper script is not cd'ing me anywhere. So, it
>> > still is a mystery why the app complaint about the file not present
>> > when run from wrapper and it works when run manually in the same dir.
>> >
>> > On Tue, May 22, 2012 at 11:34 AM, Michael Wilde < wilde at mcs.anl.gov >
>> > wrote:
>> >
>> >
>> > Isnt this line problematic if you dont know where the wrapper script
>> > has you cd'ed to:
>> >
>> > cp -v home/ketan/ketan_mars/MARS-LIC .
>> > ^^^
>> >
>> > The relative path doesnt seem safe.
>> >
>> >
>> > - Mike
>> >
>> >
>> > ----- Original Message -----
>> > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>> > > To: "Michael Wilde" < wilde at mcs.anl.gov >
>> > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
>> >
>> >
>> > > Sent: Tuesday, May 22, 2012 10:18:11 AM
>> > > Subject: Re: [Swift-user] Deep recursion on subroutine
>> > > "main::stageout" at /home/ketan/work/ worker.pl line 1349
>> > > Looking this further, I now have a wrapper in place which copies the
>> > > licence file in the cwd before running the executable. However, the
>> > > executable still gets into error as if the licence file is not
>> > > present.
>> > >
>> > >
>> > > When I cd into this dir (swift.workdir/mars-20120519-1203-3l....)
>> > > and
>> > > manually run the executable, it works.
>> > >
>> > >
>> > > So, the question is does the _swiftwrap.staging does some internal
>> > > cd'ing before calling the executable? I will take a look inside, but
>> > > would be useful if someone knows this.
>> > >
>> > >
>> > > The wrapper script is simply the following two lines:
>> > >
>> > >
>> > > """
>> > > cp -v home/ketan/ketan_mars/MARS-LIC .
>> > > /home/ketan/ketan_mars/marsMain $1
>> > > """
>> > >
>> > >
>> > > Regards,
>> > > Ketan
>> > >
>> > >
>> > > On Mon, May 21, 2012 at 7:51 PM, Michael Wilde < wilde at mcs.anl.gov >
>> > > wrote:
>> > >
>> > >
>> > > Im surprised that Swift isn't setting the current working dir (cwd)
>> > > to
>> > > be the job dir, but perhaps that's controlled by this property:
>> > >
>> > > # Determines if Swift remote wrappers will be executed by specifying
>> > > an
>> > > # absolute path, or a path relative to the job initial working
>> > > directory
>> > > #
>> > > # valid values: absolute, relative
>> > > # wrapper.invocation.mode=absolute
>> > >
>> > > Can you try your script with this property set to "relative"?
>> > >
>> > > ...but looking at this further: I see that if youre using coasters
>> > > with provider staging, the logic for job launch is quite different.
>> > > We
>> > > need to study this and get back to you. For now, best to force the
>> > > right cd's with a wrapper. You might be able to remove the wrapper
>> > > later, once we resolve how the job dir management should work in
>> > > these
>> > > various cases.
>> > >
>> > >
>> > > - Mike
>> > >
>> > >
>> > > ----- Original Message -----
>> > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>> > >
>> > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
>> > > > Cc: "Swift User" < swift-user at ci.uchicago.edu >
>> > > > Sent: Monday, May 21, 2012 4:28:02 PM
>> > > > Subject: Re: [Swift-user] Deep recursion on subroutine
>> > > > "main::stageout" at /home/ketan/work/ worker.pl line 1349
>> >
>> >
>> > > > Thanks Mike. Indeed the recursion was a warning.
>> > > >
>> > > >
>> > > > I found the problem was that the binary could not find the licence
>> > > > in
>> > > > the cwd from where it was being called. This is an application
>> > > > requirement that the licence file must be present in the cwd from
>> > > > where the call is made.
>> > > >
>> > > >
>> > > > However, Swift makes a dirtree in the workdir, stages the files
>> > > > and
>> > > > calls the binary from *outside* of this tree. Is it possible to
>> > > > make
>> > > > swift stage the licence file and put it on the top level without
>> > > > writing a wrapper to do a cp. Again, the point of not wrapping the
>> > > > binary into a script is to mimic the Hadoop setup as close as
>> > > > possible.
>> > > >
>> > > >
>> > > > On Mon, May 21, 2012 at 3:35 PM, Michael Wilde < wilde at mcs.anl.gov
>> > > > >
>> > > > wrote:
>> > > >
>> > > >
>> > > > Ketan, as far as I can tell, that message, coming from worker.pl ,
>> > > > is
>> > >
>> > > > just a warning.
>> > > >
>> > > > Programing Perl sec 33, Diagnostic Messages: "Deep recursion on
>> > > > subroutine "%s"
>> > > >
>> > > > (W recursion) This subroutine has called itself (directly or
>> > > > indirectly) 100 times more than it has returned. This probably
>> > > > indicates an infinite recursion, unless you're writing strange
>> > > > benchmark programs, in which case it indicates something else."
>> > > >
>> > > > The stageout code in worker.pl is indeed recursive, and the
>> > > > warning
>> > > > could be suppressed:
>> > > >
>> > > > "Try placing
>> > > >
>> > > > no warnings 'recursion';
>> > > >
>> > > > within the same scope as that code ..."
>> > > >
>> > > > Can you try a simple mod to catsn, using your ext mapper, to see
>> > > > if
>> > > > it
>> > > > is indeed failing due to the deeply recursive stageout?
>> > > >
>> > > > If you could dig a bit deeper into this, and see whether its
>> > > > really
>> > > > failing when staging back so many files or failing for some other,
>> > > > or
>> > > > related, reason, that would be great.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > - Mike
>> > > >
>> > > >
>> > > >
>> > > > ----- Original Message -----
>> > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
>> > > > > To: "Swift User" < swift-user at ci.uchicago.edu >
>> > > > > Sent: Monday, May 21, 2012 1:54:34 PM
>> > > > > Subject: [Swift-user] Deep recursion on subroutine
>> > > > > "main::stageout"
>> > >
>> > >
>> > > > > at /home/ketan/work/ worker.pl line 1349
>> > > > > Hi,
>> > > > >
>> > > > >
>> > > > > I am trying to run the GE mars script on a bag of workstations.
>> > > > > I
>> > > > > tested the script for a sufficient number of tasks and seems to
>> > > > > be
>> > > > > working fine on localhost.
>> > > > >
>> > > > >
>> > > > > However, it fails in this setup. I get the error message as
>> > > > > follows
>> > > > > after seemingly right invocation:
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Find: keepalive(120), reconnect - http://128.84.97.46:41287
>> > > > > Progress: time: Mon, 21 May 2012 14:43:18 -0400 Stage in:7
>> > > > > Submitted:3
>> > > > > Progress: time: Mon, 21 May 2012 14:43:19 -0400 Stage in:8
>> > > > > Active:2
>> > > > > Deep recursion on subroutine "main::stageout" at
>> > > > > /home/ketan/work/
>> > > > > worker.pl line 1349.
>> > > > > Deep recursion on subroutine "main::stageout" at
>> > > > > /home/ketan/work/
>> > > > > worker.pl line 1349.
>> > > > > Progress: time: Mon, 21 May 2012 14:43:20 -0400 Active:3 Stage
>> > > > > out:7
>> > > > >
>> > > > >
>> > > > > Obviously the staging out of results fails and seems that the
>> > > > > number
>> > > > > of files in the stageout stage is causing the error. The
>> > > > > application
>> > > > > needs to stage out about 120 files.
>> > > > >
>> > > > >
>> > > > > One solution I could quickly think of is to wrap the app in a
>> > > > > shell
>> > > > > and zip the outputs making it just one staged out file.
>> > > > >
>> > > > >
>> > > > > However, the current setup would still be useful since we are
>> > > > > trying
>> > > > > to compare the existing Hadoop solution with the Swift one.
>> > > > >
>> > > > >
>> > > > > Is there any possible workaround, some env setting or so that I
>> > > > > could
>> > > > > try and get the stageout going?
>> > > > >
>> > > > >
>> > > > > The logs are:
>> > > > > http://www.mcs.anl.gov/~ketan/mars-20120521-1443-d6q9lr0a.log
>> > > > > and http://www.mcs.anl.gov/~ketan/workerlogs.tgz
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Regards, --
>> > > > > Ketan
>> > > > >
>> > > > >
>> > > > >
>> > > > > _______________________________________________
>> > > > > Swift-user mailing list
>> > > > > Swift-user at ci.uchicago.edu
>> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> > > >
>> > > > --
>> > > > Michael Wilde
>> > > > Computation Institute, University of Chicago
>> > > > Mathematics and Computer Science Division
>> > > > Argonne National Laboratory
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Ketan
>> > >
>> > > --
>> > > Michael Wilde
>> > > Computation Institute, University of Chicago
>> > > Mathematics and Computer Science Division
>> > > Argonne National Laboratory
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Ketan
>> >
>> > --
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Ketan
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>>
>>
>> --
>> Ketan
>>
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
> --
> Ketan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120522/6277bb97/attachment.html>
More information about the Swift-user
mailing list