[Swift-user] Swift is stuck with 5K jobs

Andriy Fedorov fedorov at bwh.harvard.edu
Mon Mar 14 16:20:52 CDT 2011


On Mon, Mar 14, 2011 at 16:49, Michael Wilde <wilde at mcs.anl.gov> wrote:
> but its a strategy that we could help you with if and when needed.
>

Good to know there is a strategy!

>From the practical point of view, my approach will be to run from head
node until someone slaps my hands, and if the job is terminated
prematurely because it exceeds the wallclock limit on the head node,
use Swift restart capability to continue.

If someone puts together a documentation how to run it from a remote
location, I might try it.

So I understand this was not a problem so far? Does this mean no-one
ever ran a swift script that was taking more than 30 minutes to
complete, including queue time? Just curious if there is another
workaround I am not aware of.

> I think another feature of Swift, Collective Data Management (CDM), can be used in such cases to specify that your data files exist on the cluster side rather than
> on the host that's running the swift command.
>
> - Mike
>
> ----- Original Message -----
>> Michael,
>>
>> This is a very good observation.
>>
>> The problem is one has to know approximately how long the total run of
>> the swift script will take, which includes the time to wait in the
>> queue for the computing resources. I do not know how such estimations
>> can be reliably obtained.
>>
>> IMHO, submission from the head node is ok, since it occupies only one
>> CPU. However, I believe processes that are running on the head node
>> for more than 30 minutes are terminated automatically, so submission
>> from the head node may not work for all cases.
>>
>> Any other ideas?
>>
>> --
>> Andriy Fedorov, Ph.D.
>>
>> Research Fellow
>> Brigham and Women's Hospital
>> Harvard Medical School
>> 75 Francis Street
>> Boston, MA 02115 USA
>> fedorov at bwh.harvard.edu
>> (617) 525-6258 (office)
>>
>>
>>
>> On Mon, Mar 14, 2011 at 13:45, Michael Wilde <wilde at mcs.anl.gov>
>> wrote:
>> > Andriy, All,
>> >
>> > On systems like TeraGrid hosts where the login hosts are frequently
>> > heavily loaded, we should verify that you can obtain a single
>> > interactive compute node via qsub -I on which to run the swift
>> > command (ideally under screen to make re-attachment easy) and that
>> > from there Swift can run jobs using the Coaster-over-PBS provider
>> > configuration.
>> >
>> > I suspect (and hope) that any cluster node on say abe, queenbee, and
>> > ranger can also run qsub and qstat. We should test and document
>> > that, but in the meantime, Andriy, can you try that approach? I
>> > *think* that it should be identical to running from a login host.
>> >
>> > What I want to avoid is causing too heavy a load on any login host
>> > and in the process getting Swift banned or having it associated with
>> > causing system problems.
>> >
>> > Thanks and regards,
>> >
>> > - Mike
>> >
>> >
>> > ----- Original Message -----
>> >> On Mon, 2011-03-14 at 11:06 -0400, Andriy Fedorov wrote:
>> >> > Am I hitting some limit? Is 5K jobs too much?
>> >>
>> >> Shouldn't be, but if you have the coaster service running in local
>> >> mode,
>> >> that might do the trick.
>> >>
>> >> >
>> >> > How do I terminate swift now not to waste cycles of the head
>> >> > node?
>> >>
>> >> kill -9 <pidOfJavaProcess>
>> >>
>> >>
>> >> _______________________________________________
>> >> Swift-user mailing list
>> >> Swift-user at ci.uchicago.edu
>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>> >
>> > --
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> >
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>



More information about the Swift-user mailing list