[Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider

Ketan Maheshwari ketancmaheshwari at gmail.com
Wed Oct 24 08:25:47 CDT 2012


Hi Mike,

Seems it is resolved now. There were multiple issues:

In my config file use provider staging was set to true and in sites file
staging method was set to file. This was conflicting with cdm link creation
because the file with link name was already present. This was resolved by
setting the above option to false and removing the staging method line from
sites.xml

Turns out that Mars only works when the licence file is present in the same
dir as data. It does not like licence file symlinked for some reason. So,
it had to be excluded from getting cdm'd. I use individual patterns to cdm
inputs.

In one of the configuration, where I set all my output file mappings to
absolute paths in source swift script as well as mappers.sh, I was getting
falsely successful jobs: swift did not complain but only blank output files
were touch'd (by cdm?). It complained in the end when the files were not
found to the last job which accepts them as input.

Another issue was with the workdir in my sites.xml. It was a relative path
in mine whereas was absolute path in your case. Swift complained with exit
status 127 in my case and worked when I provide absolute path. I am not
sure if this was trunk or 0.93.1. I will check again.

In an earlier issue where I mentioned Swift not starting the number of
parallel jobs for local provider corresponding to the jobthrottle value, I
observe that indeed this is true for the local provider but does not seem
to be true when using coasters *locally*. Consequently, I tried both
approaches on a 32-core machine and found that in the case of coaster
provider the performance was better compared to the local provider *with*
CDM (Although only the inputs were cdm'd: 7M per job). Here are the results
for different throttle values (intended to use different number of cpus)
with coasters:

8 cores -- 13m 25sec
16 cores -- 12m 40sec
24 cores -- 10m 51sec
32 cores -- 10m 57sec

With local provider, some inputs cdm'd:

8 cores -- 15m 8sec
16 cores -- 12m 4sec
24 cores -- 12m 37sec
32 cores -- 11m 39sec

It looks like coaster provider does not take the datamovement to jobs ratio
into account and in this case it turns out to be faster.

I observe that local provider starts with a much less number of jobs and
slowly picks up with more jobs and reached the peak intended number almost
always after 25% of jobs completes.

Regards,
Ketan

On Tue, Oct 23, 2012 at 7:14 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> I just noticed your mention here of a "too many open files" problem.
>
> Can you tell me what "ulimit -n" (max # of open files) reports for your
> system?
>
> Can you alter your app script to return the 100+ files in a tarball
> instead of individually?
>
> What may be happening here is:
>
> - if you have low -n limit (eg 1024) and
>
> - you are using provider staging, meaning the swift or coaster service jvm
> will be writing the final output files directly and
>
> - you are writing 32 jobs x 100 files files concurrently then
>
> -> you will exceed your limit of open files.
>
> Just a hypothesis - you'll need to dig deeper and see if you can extend
> the ulimit for -n.
>
> - Mike
>
> ----- Original Message -----
> > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 23, 2012 2:02:15 PM
> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> number of parallel jobs on local provider
> > Mike,
> >
> >
> > Thank you for your answers.
> >
> >
> > I tried catsnsleep with n=100 and s=10 and indeed the number of
> > parallel jobs corresponded to the jobthrottle value.
> > Surprisingly, when I started the mars application immediately after
> > this, it also started 32 jobs in parallel. However, the run failed
> > with "too many open files" error after a while.
> >
> >
> > Now, I am trying cdm method. Will keep you posted.
> >
> >
> > On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < wilde at mcs.anl.gov >
> > wrote:
> >
> >
> > Ketan, looking further I see that your app has a large number of
> > output files, O(100). Depending on their size, and the speed of the
> > filesystem on which you are testing, that re-inforces my suspicion
> > that low concurrency you are seeing is due to staging IO.
> >
> > If this is a local 32-core host, try running with your input and
> > output data and workdirectory all on a local hard disk (or even
> > /dev/shm if it has sufficient RAM/space). Then try using CDM direct as
> > explained at:
> >
> >
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
> >
> >
> > - Mike
> >
> > ----- Original Message -----
> >
> >
> > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Tuesday, October 23, 2012 12:23:34 PM
> > > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> > > number of parallel jobs on local provider
> > > Hi Ketan,
> > >
> > > In the log you attached I see this:
> > >
> > > <profile key="jobThrottle" namespace="karajan">0.10</profile>
> > > <profile namespace="karajan" key="initialScore">100000</profile>
> > >
> > > You should leave initialScore constant, and set to a large number,
> > > no
> > > matter what level of manual throttling you want to specify via
> > > sites.xml. We always use 10000 for this value. Don't attempt to vary
> > > the initialScore value for manual throttle: just use jobThrottle to
> > > set what you want.
> > >
> > > A jobThrottle value of 0.10 should run 11 jobs in parallel
> > > (jobThrottle * 100) + 1 (for historical reasons related to the
> > > automatic throttling algorithm).
> > >
> > > If you are seeing less than that, one common cause is that the ratio
> > > of your input staging times to your job run times is so high as to
> > > make it impossible for Swift to keep the expected/desired number of
> > > jobs in active state at once.
> > >
> > > I suggest you test the throttle behavior with a simple app script
> > > like
> > > "catsnsleep" (catsn with an artificial sleep to increase job
> > > duration). If your settings (sites + cf) work for that test, then
> > > they
> > > should work for the real app, within the staging constraints. Using
> > > CDM "direct" mode is likely what you want here to eliminate
> > > unnecessary staging on a local cluster.
> > >
> > > In your test, what was this ratio? Can you also post your cf file
> > > and
> > > the progress log from stdout/stderr?
> > >
> > > - Mike
> > >
> > > ----- Original Message -----
> > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > To: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > Sent: Tuesday, October 23, 2012 10:34:25 AM
> > > > Subject: [Swift-devel] jobthrottle value does not correspond to
> > > > number of parallel jobs on local provider
> > > > Hi,
> > > >
> > > >
> > > > I am trying to run an experiment on a 32-core machine with the
> > > > hope
> > > > of
> > > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> > > > these numbers of parallel jobs by setting the Karajan jobthrottle
> > > > values in sites.xml to 0.07, 0.15, and so on.
> > > >
> > > >
> > > > However, it seems that the values are not corresponding to what I
> > > > see
> > > > in the Swift progress text.
> > > >
> > > >
> > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> > > > parallel. Then I added the line setting "Initialscore" value to
> > > > 10000,
> > > > which improved the jobs to 5. After this a 10-fold increase in
> > > > "initialscore" did not improve the jobs count.
> > > >
> > > >
> > > > Furthermore, a new batch of 5 jobs get started only when *all*
> > > > jobs
> > > > from the old batch are over as opposed to a continuous supply of
> > > > jobs
> > > > from "site selection" to "stage out" state which happens in the
> > > > case
> > > > of coaster and other providers.
> > > >
> > > >
> > > > The behavior is same in Swift 0.93.1 and latest trunk.
> > > >
> > > >
> > > >
> > > > Thank you for any clues on how to set the expected number of
> > > > parallel
> > > > jobs to these values.
> > > >
> > > >
> > > > Please find attached one such log of this run.
> > > > Thanks, --
> > > > Ketan
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> >
> >
> > --
> > Ketan
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20121024/b09c98f5/attachment.html>


More information about the Swift-devel mailing list