[Swift-user] Swift is stuck with 5K jobs

Michael Wilde wilde at mcs.anl.gov
Mon Mar 14 15:49:24 CDT 2011


Andriy,

Another alternative is to run Swift outside of the cluster - eg on a script execution host at your home institution - and manually start the coaster workers in a PBS job. These workers would connect back to the swift command (or to an external coaster server process) to pick up jobs to run.

This takes some scripting and documentation that is not provided in the Swift release yet, but its a strategy that we could help you with if and when needed.

I think another feature of Swift, Collective Data Management (CDM), can be used in such cases to specify that your data files exist on the cluster side rather than 
on the host that's running the swift command.

- Mike

----- Original Message -----
> Michael,
> 
> This is a very good observation.
> 
> The problem is one has to know approximately how long the total run of
> the swift script will take, which includes the time to wait in the
> queue for the computing resources. I do not know how such estimations
> can be reliably obtained.
> 
> IMHO, submission from the head node is ok, since it occupies only one
> CPU. However, I believe processes that are running on the head node
> for more than 30 minutes are terminated automatically, so submission
> from the head node may not work for all cases.
> 
> Any other ideas?
> 
> --
> Andriy Fedorov, Ph.D.
> 
> Research Fellow
> Brigham and Women's Hospital
> Harvard Medical School
> 75 Francis Street
> Boston, MA 02115 USA
> fedorov at bwh.harvard.edu
> (617) 525-6258 (office)
> 
> 
> 
> On Mon, Mar 14, 2011 at 13:45, Michael Wilde <wilde at mcs.anl.gov>
> wrote:
> > Andriy, All,
> >
> > On systems like TeraGrid hosts where the login hosts are frequently
> > heavily loaded, we should verify that you can obtain a single
> > interactive compute node via qsub -I on which to run the swift
> > command (ideally under screen to make re-attachment easy) and that
> > from there Swift can run jobs using the Coaster-over-PBS provider
> > configuration.
> >
> > I suspect (and hope) that any cluster node on say abe, queenbee, and
> > ranger can also run qsub and qstat. We should test and document
> > that, but in the meantime, Andriy, can you try that approach? I
> > *think* that it should be identical to running from a login host.
> >
> > What I want to avoid is causing too heavy a load on any login host
> > and in the process getting Swift banned or having it associated with
> > causing system problems.
> >
> > Thanks and regards,
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> >> On Mon, 2011-03-14 at 11:06 -0400, Andriy Fedorov wrote:
> >> > Am I hitting some limit? Is 5K jobs too much?
> >>
> >> Shouldn't be, but if you have the coaster service running in local
> >> mode,
> >> that might do the trick.
> >>
> >> >
> >> > How do I terminate swift now not to waste cycles of the head
> >> > node?
> >>
> >> kill -9 <pidOfJavaProcess>
> >>
> >>
> >> _______________________________________________
> >> Swift-user mailing list
> >> Swift-user at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list