[Swift-devel] Update on Teraport problems with wavlet workflow

Wed Feb 28 12:16:12 CST 2007

do you have kickstart records for the jobs that are failing?

On Wed, 28 Feb 2007, Tiberiu Stef-Praun wrote:

> Here is more info:
> Indeed, yesterday I got 175 successful jobs from the total of 192, and
> the workflow never ended (it kept retrying transferring fiiles from
> the the failed ones, which it failed because they did not exist).
> Looking at the processors load and at the transfer load, the total
> 175jobs were done in about 75minutes (about 10x speedup from a
> serialized execution).
> 
> At Mihael's suggestion I started with smaller workflows, so here are
> the numbers (for the ones that completed successfully):
> 1 job: 4 minutes
> 6jobs: 6 minutes
> 24 jobs: 20 minutes
> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).
> 
> I have a total of 192 jobs to run.
> 
> 
> I have retried running some of the failed workflows, and they fail
> because some task in the workflow is not run correctly. For instance,
> the most troubling one was the latest run: the jobs submitted failed
> right at the beginning, even though they have run successfully in the
> previous run.
> My current assumption is that one (?several) cluster nodes are bad.
> The failure can be observed in the log in the following way: job gets
> submitted, andd 20 seconds later, gram declares is finished (normal
> execution time is about 3 minutes), so the workflow attempts to
> transfer back some inexistent files (nothing gets generated, neither
> outputs, nor stdout,stderr,kickstart in the job's working directory),
> and it creates on the submission machine files of size zero. That is
> not good because when attempting a -resume, those failed jobs are not
> re-considered for execution.
> 
> Summary/Speculation: bad teraport node causes job to be declared as
> done even though the execution failed
> 
> I will move to another Grid site, to run in there locally, and
> hopefully not get the same behavior as on teraport.
> 
> Tibi
> 
> On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > Mihael informs me that the latest problems with the wavlet workflow indicate
> > that some number of jobs in the workflow are failing to launch under PBS
> > through the pre-WS GRAM provider.  These failing jobs seem to give no
> > indication whatsoever where the underlying failure is occurring.
> > 
> > I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
> > are failing in this manner (not sure I have these numbers right).
> > 
> > Mihael is continuing to experiment to characterize the failure better and
> > will
> > report back to the group (and involve the TP and GRAM support teams) when he
> > knows more.
> > 
> > - Mike
> > 
> > --
> > Mike Wilde
> > Computation Institute, University of Chicago
> > Math & Computer Science Division
> > Argonne National Laboratory
> > Argonne, IL   60439    USA
> > tel 630-252-7497 fax 630-252-1997
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> 
>