[Swift-devel] Update on Teraport problems with wavlet workflow

Wed Feb 28 12:36:45 CST 2007

I stopped it

On 2/28/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote:
> > Here is more info:
> > Indeed, yesterday I got 175 successful jobs from the total of 192, and
> > the workflow never ended
>
> Did the workflow lock up or did you interrupt it because you got tired
> of it trying to transfer all the missing files?
>
> > (it kept retrying transferring fiiles from
> > the the failed ones, which it failed because they did not exist).
> > Looking at the processors load and at the transfer load, the total
> > 175jobs were done in about 75minutes (about 10x speedup from a
> > serialized execution).
> >
> > At Mihael's suggestion I started with smaller workflows, so here are
> > the numbers (for the ones that completed successfully):
> > 1 job: 4 minutes
> > 6jobs: 6 minutes
> > 24 jobs: 20 minutes
> > 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).
> >
> > I have a total of 192 jobs to run.
> >
> >
> > I have retried running some of the failed workflows, and they fail
> > because some task in the workflow is not run correctly. For instance,
> > the most troubling one was the latest run: the jobs submitted failed
> > right at the beginning, even though they have run successfully in the
> > previous run.
> > My current assumption is that one (?several) cluster nodes are bad.
> > The failure can be observed in the log in the following way: job gets
> > submitted, andd 20 seconds later, gram declares is finished (normal
> > execution time is about 3 minutes), so the workflow attempts to
> > transfer back some inexistent files (nothing gets generated, neither
> > outputs, nor stdout,stderr,kickstart in the job's working directory),
> > and it creates on the submission machine files of size zero. That is
> > not good because when attempting a -resume, those failed jobs are not
> > re-considered for execution.
> >
> > Summary/Speculation: bad teraport node causes job to be declared as
> > done even though the execution failed
> >
> > I will move to another Grid site, to run in there locally, and
> > hopefully not get the same behavior as on teraport.
> >
> > Tibi
> >
> > On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > > Mihael informs me that the latest problems with the wavlet workflow indicate
> > > that some number of jobs in the workflow are failing to launch under PBS
> > > through the pre-WS GRAM provider.  These failing jobs seem to give no
> > > indication whatsoever where the underlying failure is occurring.
> > >
> > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
> > > are failing in this manner (not sure I have these numbers right).
> > >
> > > Mihael is continuing to experiment to characterize the failure better and will
> > > report back to the group (and involve the TP and GRAM support teams) when he
> > > knows more.
> > >
> > > - Mike
> > >
> > > --
> > > Mike Wilde
> > > Computation Institute, University of Chicago
> > > Math & Computer Science Division
> > > Argonne National Laboratory
> > > Argonne, IL   60439    USA
> > > tel 630-252-7497 fax 630-252-1997
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> >
> >
>
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/