[Swift-devel] Update on Teraport problems with wavlet workflow

Tiberiu Stef-Praun tiberius at ci.uchicago.edu
Wed Feb 28 12:39:15 CST 2007


I would say that the solution is: on a missing file, fully resubmit
the job, not just try to re-read it.
Anyway, I'm trying to discover what causes the file not to be generated.

On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> Do we need to file a bug to improve the processing of the missing-file case?
>
> Ie., if the file is truly missing, this should (typically?) not cause great
> delays in the workflow proceeding, or proceeding to fail quickly.
>
> - Mike
>
> Mihael Hategan wrote, On 2/28/2007 12:31 PM:
> > On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote:
> >> Here is more info:
> >> Indeed, yesterday I got 175 successful jobs from the total of 192, and
> >> the workflow never ended
> >
> > Did the workflow lock up or did you interrupt it because you got tired
> > of it trying to transfer all the missing files?
> >
> >> (it kept retrying transferring fiiles from
> >> the the failed ones, which it failed because they did not exist).
> >> Looking at the processors load and at the transfer load, the total
> >> 175jobs were done in about 75minutes (about 10x speedup from a
> >> serialized execution).
> >>
> >> At Mihael's suggestion I started with smaller workflows, so here are
> >> the numbers (for the ones that completed successfully):
> >> 1 job: 4 minutes
> >> 6jobs: 6 minutes
> >> 24 jobs: 20 minutes
> >> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).
> >>
> >> I have a total of 192 jobs to run.
> >>
> >>
> >> I have retried running some of the failed workflows, and they fail
> >> because some task in the workflow is not run correctly. For instance,
> >> the most troubling one was the latest run: the jobs submitted failed
> >> right at the beginning, even though they have run successfully in the
> >> previous run.
> >> My current assumption is that one (?several) cluster nodes are bad.
> >> The failure can be observed in the log in the following way: job gets
> >> submitted, andd 20 seconds later, gram declares is finished (normal
> >> execution time is about 3 minutes), so the workflow attempts to
> >> transfer back some inexistent files (nothing gets generated, neither
> >> outputs, nor stdout,stderr,kickstart in the job's working directory),
> >> and it creates on the submission machine files of size zero. That is
> >> not good because when attempting a -resume, those failed jobs are not
> >> re-considered for execution.
> >>
> >> Summary/Speculation: bad teraport node causes job to be declared as
> >> done even though the execution failed
> >>
> >> I will move to another Grid site, to run in there locally, and
> >> hopefully not get the same behavior as on teraport.
> >>
> >> Tibi
> >>
> >> On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> >>> Mihael informs me that the latest problems with the wavlet workflow indicate
> >>> that some number of jobs in the workflow are failing to launch under PBS
> >>> through the pre-WS GRAM provider.  These failing jobs seem to give no
> >>> indication whatsoever where the underlying failure is occurring.
> >>>
> >>> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
> >>> are failing in this manner (not sure I have these numbers right).
> >>>
> >>> Mihael is continuing to experiment to characterize the failure better and will
> >>> report back to the group (and involve the TP and GRAM support teams) when he
> >>> knows more.
> >>>
> >>> - Mike
> >>>
> >>> --
> >>> Mike Wilde
> >>> Computation Institute, University of Chicago
> >>> Math & Computer Science Division
> >>> Argonne National Laboratory
> >>> Argonne, IL   60439    USA
> >>> tel 630-252-7497 fax 630-252-1997
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>
> >
> >
>
> --
> Mike Wilde
> Computation Institute, University of Chicago
> Math & Computer Science Division
> Argonne National Laboratory
> Argonne, IL   60439    USA
> tel 630-252-7497 fax 630-252-1997
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/



More information about the Swift-devel mailing list