[Swift-devel] second wave of jobs do not start

Ketan Maheshwari ketan at mcs.anl.gov
Wed Mar 11 14:34:42 CDT 2015


It was related to job and wall times. The maxWalltime was set to 18 minutes
and maxJobtime to 20 minutes. After completion of first 8 jobs, coaster
thinks there is no more time to accommodate anymore jobs.

The run completes after setting maxWalltime=4mins and maxJobtime=60mins.
Run was configured to start 2 parallel tasks at a time with 64 total tasks
each spanning 2-3 sec.

Some notes are below. A few things were not clear from the worker logs
which I am trying to study.

-- Actually, it was not 1 wave, but 4 waves of 2 tasks were executed (in
total 8 tasks).

-- The worker starts twice: first instance shuts down after running 2 waves
and idling for 3-4 minutes.

-- Second instance of worker starts, runs 2 waves of jobs but keeps on
idling for more than an half an hour after which I kill the run. In this
time, the scheduler job remains in running stage and shuts down when its
walltime expires (20 minutes).

-- Each worker shows 8 process forked and terminated. Totalling 16
processes in all but we see only 8 tasks. My guess is that for each
process, worker forks a watchdog/monitor process (I can be wrong here).

--
Ketan

On Wed, Mar 11, 2015 at 2:21 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> And I'd like to know what the issue was!
>
> Mihael
>
> On Wed, 2015-03-11 at 14:16 -0500, Ketan Maheshwari wrote:
> > Hi,
> >
> > Please ignore, this was resolved after discussion and debugging with
> Mike.
> >
> > --Ketan
> >
> > On Wed, Mar 11, 2015 at 10:33 AM, Ketan Maheshwari <ketan at mcs.anl.gov>
> > wrote:
> >
> > > Hi
> > >
> > > With trunk, coasters on ALCF, I am seeing that after a first wave of
> jobs
> > > finish, the second wave does not start.
> > >
> > > After the completion of first wave of jobs, the Swift progress text
> shows
> > > jobs in submitted state while the queue (qstat) still shows running
> status.
> > > After a while the queue walltime expires and there are no more new jobs
> > > submitted to the queue.
> > >
> > > Two worker log files are created for the run, possibly the worker shuts
> > > down and restarts for a second wave.
> > >
> > > Attached are the run log and worker logs.
> > >
> > > Thanks for any help debugging/fixing.
> > > --
> > > Ketan
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20150311/1a4ad6fd/attachment.html>


More information about the Swift-devel mailing list