[Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider

Michael Wilde wilde at mcs.anl.gov
Tue Oct 23 18:14:02 CDT 2012


I just noticed your mention here of a "too many open files" problem.

Can you tell me what "ulimit -n" (max # of open files) reports for your system?

Can you alter your app script to return the 100+ files in a tarball instead of individually?

What may be happening here is:

- if you have low -n limit (eg 1024) and

- you are using provider staging, meaning the swift or coaster service jvm will be writing the final output files directly and

- you are writing 32 jobs x 100 files files concurrently then

-> you will exceed your limit of open files.

Just a hypothesis - you'll need to dig deeper and see if you can extend the ulimit for -n.

- Mike

----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, October 23, 2012 2:02:15 PM
> Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider
> Mike,
> 
> 
> Thank you for your answers.
> 
> 
> I tried catsnsleep with n=100 and s=10 and indeed the number of
> parallel jobs corresponded to the jobthrottle value.
> Surprisingly, when I started the mars application immediately after
> this, it also started 32 jobs in parallel. However, the run failed
> with "too many open files" error after a while.
> 
> 
> Now, I am trying cdm method. Will keep you posted.
> 
> 
> On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> Ketan, looking further I see that your app has a large number of
> output files, O(100). Depending on their size, and the speed of the
> filesystem on which you are testing, that re-inforces my suspicion
> that low concurrency you are seeing is due to staging IO.
> 
> If this is a local 32-core host, try running with your input and
> output data and workdirectory all on a local hard disk (or even
> /dev/shm if it has sufficient RAM/space). Then try using CDM direct as
> explained at:
> 
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases
> 
> 
> - Mike
> 
> ----- Original Message -----
> 
> 
> > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > Sent: Tuesday, October 23, 2012 12:23:34 PM
> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to
> > number of parallel jobs on local provider
> > Hi Ketan,
> >
> > In the log you attached I see this:
> >
> > <profile key="jobThrottle" namespace="karajan">0.10</profile>
> > <profile namespace="karajan" key="initialScore">100000</profile>
> >
> > You should leave initialScore constant, and set to a large number,
> > no
> > matter what level of manual throttling you want to specify via
> > sites.xml. We always use 10000 for this value. Don't attempt to vary
> > the initialScore value for manual throttle: just use jobThrottle to
> > set what you want.
> >
> > A jobThrottle value of 0.10 should run 11 jobs in parallel
> > (jobThrottle * 100) + 1 (for historical reasons related to the
> > automatic throttling algorithm).
> >
> > If you are seeing less than that, one common cause is that the ratio
> > of your input staging times to your job run times is so high as to
> > make it impossible for Swift to keep the expected/desired number of
> > jobs in active state at once.
> >
> > I suggest you test the throttle behavior with a simple app script
> > like
> > "catsnsleep" (catsn with an artificial sleep to increase job
> > duration). If your settings (sites + cf) work for that test, then
> > they
> > should work for the real app, within the staging constraints. Using
> > CDM "direct" mode is likely what you want here to eliminate
> > unnecessary staging on a local cluster.
> >
> > In your test, what was this ratio? Can you also post your cf file
> > and
> > the progress log from stdout/stderr?
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > To: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Tuesday, October 23, 2012 10:34:25 AM
> > > Subject: [Swift-devel] jobthrottle value does not correspond to
> > > number of parallel jobs on local provider
> > > Hi,
> > >
> > >
> > > I am trying to run an experiment on a 32-core machine with the
> > > hope
> > > of
> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control
> > > these numbers of parallel jobs by setting the Karajan jobthrottle
> > > values in sites.xml to 0.07, 0.15, and so on.
> > >
> > >
> > > However, it seems that the values are not corresponding to what I
> > > see
> > > in the Swift progress text.
> > >
> > >
> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in
> > > parallel. Then I added the line setting "Initialscore" value to
> > > 10000,
> > > which improved the jobs to 5. After this a 10-fold increase in
> > > "initialscore" did not improve the jobs count.
> > >
> > >
> > > Furthermore, a new batch of 5 jobs get started only when *all*
> > > jobs
> > > from the old batch are over as opposed to a continuous supply of
> > > jobs
> > > from "site selection" to "stage out" state which happens in the
> > > case
> > > of coaster and other providers.
> > >
> > >
> > > The behavior is same in Swift 0.93.1 and latest trunk.
> > >
> > >
> > >
> > > Thank you for any clues on how to set the expected number of
> > > parallel
> > > jobs to these values.
> > >
> > >
> > > Please find attached one such log of this run.
> > > Thanks, --
> > > Ketan
> > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list