[Swift-devel] Update on Teraport problems with wavlet workflow

Tiberiu Stef-Praun tiberius at ci.uchicago.edu
Wed Feb 28 12:27:07 CST 2007


In this case, everything that was submitted failed, and as soon as I
noticed the failures, I stopped the workflow, to investigate the cause
of the failure.
Will re-run and let you know about kickstart records.

On 2/28/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> do you have kickstart records for the nodes that *do* run?
>
> On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote:
>
> > Nothing gets generated in the individual job's temporary directories.
> > There is no kickstart record.
> > It would be really useful finding out the hostname of the node on
> > which these jobs ran.
> >
> > Let me retry some more workflow runs.
> >
> > On 2/28/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > >
> > >
> > > On Wed, 28 Feb 2007, Ben Clifford wrote:
> > >
> > > > do you have kickstart records for the jobs that are failing?
> > >
> > > if you do, then:
> > >
> > > > > Summary/Speculation: bad teraport node causes job to be declared as
> > > > > done even though the execution failed
> > >
> > > this speculation can be investigated further by:
> > >
> > > finding a job that breaks. finding the node name from the kickstart
> > > record. grepping all the kickstart records to find other kickstart records
> > > for those jobs. looking to see if they all fail, or if some work and some
> > > fail. then report back findings here.
> > >
> > > --
> > >
> >
> >
> >
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/



More information about the Swift-devel mailing list