[Swift-devel] Condor with coasters question

Sat Apr 28 17:22:58 CDT 2012

Try reducing maxtime or increasing maxwalltime.  The Swift script is running 500 cat jobs, right?

Each worker has 60 mins wall time. Each app is sized at 5 mins, so 12 app calls can fit per coaster slot.

By the time that coasters started 100 slots, it had allocated all the slots it needed to run the work your script has queued up.

If you make your jobs run longer (eg use catsnsleep for say 30 secs), and you give maxwalltime of 10 mins (600), and say maxtime of 900 secs, then the coaster scheduler will think it needs a separate coaster for each one, which is what you want to see.

Similarly, when you shift to the real DSSAT code.  Real runtime there is 150 secs.

Do the division to see how what time you need to specify to get the max number of coasters started.  If you give too large a maxtime, coasters will think its best to fill those slots out to their max time rather than launch more coasters.

- Mike

----- Original Message -----
> From: "David Kelly" <davidk at ci.uchicago.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, April 28, 2012 4:54:46 PM
> Subject: Re: [Swift-devel] Condor with coasters question
> I adjusted the parameters a bit and tried again with this
> configuration:
> 
> <config>
> <pool handle="uc3">
> <execution jobmanager="local:condor" provider="coaster" url="none"/>
> <filesystem provider="local" url="none" />
> <workdirectory>_WORK_</workdirectory>
> <profile namespace="globus" key="maxNodes">1000</profile>
> <profile key="slots" namespace="globus">1000</profile>
> <profile key="maxTime" namespace="globus">3600</profile>
> <profile key="maxWalltime" namespace="globus">00:05:00</profile>
> <profile key="highOverallocation" namespace="globus">100</profile>
> <profile key="lowOverallocation" namespace="globus">100</profile>
> <profile key="nodeGranularity" namespace="globus">1</profile>
> <profile key="jobsPerNode" namespace="globus">1</profile>
> <profile namespace="karajan" key="jobThrottle">1000</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> </pool>
> </config>
> 
> The maximum number of active jobs maxed out at 101 with this.
> 
> Thanks,
> David
> 
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Saturday, April 28, 2012 3:13:15 PM
> > Subject: Re: [Swift-devel] Condor with coasters question
> > I meant to cc this to swift-devel so am resending it.
> >
> > I think multi-node jobs on Condor should work in principle but in
> > practice may need to be tested and debugged.
> >
> > I think we should first see if we can fill the UC3 cluster with
> > maxnode=1 slots=500.
> >
> > One possible reason that only 70 jobs were issued is that your prior
> > test, David, looks like it was using default values for the times
> > involved, and possible Swift "packed" the pending requests into the
> > 70
> > job slots you saw. Hence my suggestion to try the config below.
> >
> > - Mike
> >
> > On Sat, Apr 28, 2012 at 1:28 PM, Michael Wilde <wilde at mcs.anl.gov>
> > wrote:
> > > David, can you try a test that specifies:
> > >
> > > Maxtime 3600
> > > Maxwalltime 00:00:10 (or as needed for your app)
> > > High and lowoverallocation 100
> > >
> > > I would think each coaster ( x 480 ) should get a separate submit
> > > file
> > > with count 1, just as would be done for PBS.
> > >
> > > - Mike
> > >
> > > On 4/28/12, David Kelly <davidk at ci.uchicago.edu> wrote:
> > >> Hello,
> > >>
> > >> I am trying to get Swift working well on a machine that uses
> > >> condor. It has
> > >> 480 available slots. I am using a swift script that will run 1000
> > >> tasks.
> > >>
> > >> sites.xml:
> > >> <config>
> > >>    <pool handle="uc3">
> > >>      <execution jobmanager="local:condor" provider="coaster"
> > >>      url="none"/>
> > >>      <filesystem provider="local" url="none" />
> > >>
> > >> <workdirectory>/home/davidk/test/benchmark-release/run012</workdirectory>
> > >>      <profile namespace="globus" key="maxNodes">480</profile>
> > >>      <profile key="slots" namespace="globus">480</profile>
> > >>      <profile key="nodeGranularity"
> > >>      namespace="globus">1</profile>
> > >>      <profile key="jobsPerNode" namespace="globus">1</profile>
> > >>      <profile namespace="karajan"
> > >>      key="jobThrottle">1000</profile>
> > >>      <profile namespace="karajan"
> > >>      key="initialScore">10000</profile>
> > >>    </pool>
> > >> </config>
> > >>
> > >> cf:
> > >> wrapperlog.always.transfer=true
> > >> sitedir.keep=false
> > >> execution.retries=0
> > >> lazy.errors=false
> > >> status.mode=provider
> > >> use.provider.staging=false
> > >> provider.staging.pin.swiftfiles=true
> > >> foreach.max.threads=1000
> > >>
> > >> What I am seeing is that only ~70 tasks are active at once. When
> > >> I
> > >> look at
> > >> condor_q, I see there are ~70 jobs that I have submitted, no
> > >> more,
> > >> none
> > >> idle. Any ideas where this limit is coming from?
> > >>
> > >> I thought I would be around this by setting nodeGranularity to
> > >> 50.
> > >> But when
> > >> I do this, what seems to happen is that there are 50 machines
> > >> allocated per
> > >> 1 worker.pl which would make sense for an MPI job, but not what I
> > >> want here.
> > >> (The condor submit script sets machine_count to 50, but only
> > >> queues
> > >> 1)
> > >>
> > >> I can get around this now by using the plain condor provider, but
> > >> ideally
> > >> would like to use coasters.
> > >>
> > >> Thanks,
> > >> David
> > >> _______________________________________________
> > >> Swift-devel mailing list
> > >> Swift-devel at ci.uchicago.edu
> > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >>
> > >
> > > --
> > > Sent from my mobile device

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory