[Swift-devel] Update on Teraport problems with wavlet workflow

Mihael Hategan hategan at mcs.anl.gov
Wed Feb 28 12:33:03 CST 2007


On Wed, 2007-02-28 at 18:16 +0000, Ben Clifford wrote:
> do you have kickstart records for the jobs that are failing?

Thing is, the wrapper for those jobs doesn't seem to put any output in
the wrapper log. Which seems to indicate that the wrapper was never
started. Which in turn may mean that kickstart would not have been
started either.

> 
> On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote:
> 
> > Here is more info:
> > Indeed, yesterday I got 175 successful jobs from the total of 192, and
> > the workflow never ended (it kept retrying transferring fiiles from
> > the the failed ones, which it failed because they did not exist).
> > Looking at the processors load and at the transfer load, the total
> > 175jobs were done in about 75minutes (about 10x speedup from a
> > serialized execution).
> > 
> > At Mihael's suggestion I started with smaller workflows, so here are
> > the numbers (for the ones that completed successfully):
> > 1 job: 4 minutes
> > 6jobs: 6 minutes
> > 24 jobs: 20 minutes
> > 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).
> > 
> > I have a total of 192 jobs to run.
> > 
> > 
> > I have retried running some of the failed workflows, and they fail
> > because some task in the workflow is not run correctly. For instance,
> > the most troubling one was the latest run: the jobs submitted failed
> > right at the beginning, even though they have run successfully in the
> > previous run.
> > My current assumption is that one (?several) cluster nodes are bad.
> > The failure can be observed in the log in the following way: job gets
> > submitted, andd 20 seconds later, gram declares is finished (normal
> > execution time is about 3 minutes), so the workflow attempts to
> > transfer back some inexistent files (nothing gets generated, neither
> > outputs, nor stdout,stderr,kickstart in the job's working directory),
> > and it creates on the submission machine files of size zero. That is
> > not good because when attempting a -resume, those failed jobs are not
> > re-considered for execution.
> > 
> > Summary/Speculation: bad teraport node causes job to be declared as
> > done even though the execution failed
> > 
> > I will move to another Grid site, to run in there locally, and
> > hopefully not get the same behavior as on teraport.
> > 
> > Tibi
> > 
> > On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > > Mihael informs me that the latest problems with the wavlet workflow indicate
> > > that some number of jobs in the workflow are failing to launch under PBS
> > > through the pre-WS GRAM provider.  These failing jobs seem to give no
> > > indication whatsoever where the underlying failure is occurring.
> > > 
> > > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
> > > are failing in this manner (not sure I have these numbers right).
> > > 
> > > Mihael is continuing to experiment to characterize the failure better and
> > > will
> > > report back to the group (and involve the TP and GRAM support teams) when he
> > > knows more.
> > > 
> > > - Mike
> > > 
> > > --
> > > Mike Wilde
> > > Computation Institute, University of Chicago
> > > Math & Computer Science Division
> > > Argonne National Laboratory
> > > Argonne, IL   60439    USA
> > > tel 630-252-7497 fax 630-252-1997
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > 
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list