[Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider
Ketan Maheshwari
ketancmaheshwari at gmail.com
Tue Oct 23 14:52:48 CDT 2012
Now trying with cdm. My cdm policy file contains a single line as follows:
rule .* DEFAULT /
This seems to be working at stage in because I immediately see my jobs
starting. However, it fails immediately after with a message:
"Execution failed:
The following output files were not created by the application:"
Followed by a list of outputs. I recall this could happen if absolute
pathnames are not provided, so I updated my mappers.sh scripts with
absolute pathnames including a double // in the beginning without success.
The run log do not show any specific indicators of error other than the
above message.
I see a bunch of CDM_POLICY CDM_ACTION lines in the wrapper.log in one of
the many jobdirs as follows:
CDM_POLICY: /home/train07/ketan_mars/swift/result52/mars.ot48 -> DEFAULT /
CDM_ACTION:
/home/train07/ketan_mars/swift/swift.workdir/mars-20121023-1240-vbptd8i9/jobs/g/mars-gtln0yzk
OUTPUT /home/train07/ketan_mars/swift/result52/mars.ot48 DEFAULT /
Not sure if something could've gone wrong here.
Attaching the log file and one of the job dirs.
Regards,
Ketan
On Tue, Oct 23, 2012 at 3:02 PM, Ketan Maheshwari <
ketancmaheshwari at gmail.com> wrote:
> Mike,
>
> Thank you for your answers.
>
> I tried catsnsleep with n=100 and s=10 and indeed the number of parallel
> jobs corresponded to the jobthrottle value.
> Surprisingly, when I started the mars application immediately after this,
> it also started 32 jobs in parallel. However, the run failed with "too many
> open files" error after a while.
>
> Now, I am trying cdm method. Will keep you posted.
>
>
> On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>> Ketan, looking further I see that your app has a large number of output
>> files, O(100). Depending on their size, and the speed of the filesystem on
>> which you are testing, that re-inforces my suspicion that low concurrency
>> you are seeing is due to staging IO.
>>
>> If this is a local 32-core host, try running with your input and output
>> data and workdirectory all on a local hard disk (or even /dev/shm if it has
>> sufficient RAM/space). Then try using CDM direct as explained at:
>>
>>
>> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
>>
>> - Mike
>>
>> ----- Original Message -----
>> > From: "Michael Wilde" <wilde at mcs.anl.gov>
>> > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> > Sent: Tuesday, October 23, 2012 12:23:34 PM
>> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
>> number of parallel jobs on local provider
>> > Hi Ketan,
>> >
>> > In the log you attached I see this:
>> >
>> > <profile key="jobThrottle" namespace="karajan">0.10</profile>
>> > <profile namespace="karajan" key="initialScore">100000</profile>
>> >
>> > You should leave initialScore constant, and set to a large number, no
>> > matter what level of manual throttling you want to specify via
>> > sites.xml. We always use 10000 for this value. Don't attempt to vary
>> > the initialScore value for manual throttle: just use jobThrottle to
>> > set what you want.
>> >
>> > A jobThrottle value of 0.10 should run 11 jobs in parallel
>> > (jobThrottle * 100) + 1 (for historical reasons related to the
>> > automatic throttling algorithm).
>> >
>> > If you are seeing less than that, one common cause is that the ratio
>> > of your input staging times to your job run times is so high as to
>> > make it impossible for Swift to keep the expected/desired number of
>> > jobs in active state at once.
>> >
>> > I suggest you test the throttle behavior with a simple app script like
>> > "catsnsleep" (catsn with an artificial sleep to increase job
>> > duration). If your settings (sites + cf) work for that test, then they
>> > should work for the real app, within the staging constraints. Using
>> > CDM "direct" mode is likely what you want here to eliminate
>> > unnecessary staging on a local cluster.
>> >
>> > In your test, what was this ratio? Can you also post your cf file and
>> > the progress log from stdout/stderr?
>> >
>> > - Mike
>> >
>> > ----- Original Message -----
>> > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> > > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> > > Sent: Tuesday, October 23, 2012 10:34:25 AM
>> > > Subject: [Swift-devel] jobthrottle value does not correspond to
>> > > number of parallel jobs on local provider
>> > > Hi,
>> > >
>> > >
>> > > I am trying to run an experiment on a 32-core machine with the hope
>> > > of
>> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
>> > > these numbers of parallel jobs by setting the Karajan jobthrottle
>> > > values in sites.xml to 0.07, 0.15, and so on.
>> > >
>> > >
>> > > However, it seems that the values are not corresponding to what I
>> > > see
>> > > in the Swift progress text.
>> > >
>> > >
>> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
>> > > parallel. Then I added the line setting "Initialscore" value to
>> > > 10000,
>> > > which improved the jobs to 5. After this a 10-fold increase in
>> > > "initialscore" did not improve the jobs count.
>> > >
>> > >
>> > > Furthermore, a new batch of 5 jobs get started only when *all* jobs
>> > > from the old batch are over as opposed to a continuous supply of
>> > > jobs
>> > > from "site selection" to "stage out" state which happens in the case
>> > > of coaster and other providers.
>> > >
>> > >
>> > > The behavior is same in Swift 0.93.1 and latest trunk.
>> > >
>> > >
>> > >
>> > > Thank you for any clues on how to set the expected number of
>> > > parallel
>> > > jobs to these values.
>> > >
>> > >
>> > > Please find attached one such log of this run.
>> > > Thanks, --
>> > > Ketan
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > Swift-devel mailing list
>> > > Swift-devel at ci.uchicago.edu
>> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> >
>> > --
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> > _______________________________________________
>> > Swift-devel mailing list
>> > Swift-devel at ci.uchicago.edu
>> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>
>
> --
> Ketan
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/7d90bfaa/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mars-debug.tgz
Type: application/x-gzip
Size: 101159 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121023/7d90bfaa/attachment.bin>
More information about the Swift-devel
mailing list