[Swift-devel] Update on Teraport problems with wavlet workflow

Mihael Hategan hategan at mcs.anl.gov
Wed Feb 28 12:40:49 CST 2007


On Wed, 2007-02-28 at 12:39 -0600, Tiberiu Stef-Praun wrote:
> I would say that the solution is: on a missing file, fully resubmit
> the job, not just try to re-read it.

That does happen. But it also tries to re-read it. A filter could be
added to not restart transfers that fail because the file is missing.

> Anyway, I'm trying to discover what causes the file not to be generated.
> 
> On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > Do we need to file a bug to improve the processing of the missing-file case?
> >
> > Ie., if the file is truly missing, this should (typically?) not cause great
> > delays in the workflow proceeding, or proceeding to fail quickly.
> >
> > - Mike
> >
> > Mihael Hategan wrote, On 2/28/2007 12:31 PM:
> > > On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote:
> > >> Here is more info:
> > >> Indeed, yesterday I got 175 successful jobs from the total of 192, and
> > >> the workflow never ended
> > >
> > > Did the workflow lock up or did you interrupt it because you got tired
> > > of it trying to transfer all the missing files?
> > >
> > >> (it kept retrying transferring fiiles from
> > >> the the failed ones, which it failed because they did not exist).
> > >> Looking at the processors load and at the transfer load, the total
> > >> 175jobs were done in about 75minutes (about 10x speedup from a
> > >> serialized execution).
> > >>
> > >> At Mihael's suggestion I started with smaller workflows, so here are
> > >> the numbers (for the ones that completed successfully):
> > >> 1 job: 4 minutes
> > >> 6jobs: 6 minutes
> > >> 24 jobs: 20 minutes
> > >> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).
> > >>
> > >> I have a total of 192 jobs to run.
> > >>
> > >>
> > >> I have retried running some of the failed workflows, and they fail
> > >> because some task in the workflow is not run correctly. For instance,
> > >> the most troubling one was the latest run: the jobs submitted failed
> > >> right at the beginning, even though they have run successfully in the
> > >> previous run.
> > >> My current assumption is that one (?several) cluster nodes are bad.
> > >> The failure can be observed in the log in the following way: job gets
> > >> submitted, andd 20 seconds later, gram declares is finished (normal
> > >> execution time is about 3 minutes), so the workflow attempts to
> > >> transfer back some inexistent files (nothing gets generated, neither
> > >> outputs, nor stdout,stderr,kickstart in the job's working directory),
> > >> and it creates on the submission machine files of size zero. That is
> > >> not good because when attempting a -resume, those failed jobs are not
> > >> re-considered for execution.
> > >>
> > >> Summary/Speculation: bad teraport node causes job to be declared as
> > >> done even though the execution failed
> > >>
> > >> I will move to another Grid site, to run in there locally, and
> > >> hopefully not get the same behavior as on teraport.
> > >>
> > >> Tibi
> > >>
> > >> On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > >>> Mihael informs me that the latest problems with the wavlet workflow indicate
> > >>> that some number of jobs in the workflow are failing to launch under PBS
> > >>> through the pre-WS GRAM provider.  These failing jobs seem to give no
> > >>> indication whatsoever where the underlying failure is occurring.
> > >>>
> > >>> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
> > >>> are failing in this manner (not sure I have these numbers right).
> > >>>
> > >>> Mihael is continuing to experiment to characterize the failure better and will
> > >>> report back to the group (and involve the TP and GRAM support teams) when he
> > >>> knows more.
> > >>>
> > >>> - Mike
> > >>>
> > >>> --
> > >>> Mike Wilde
> > >>> Computation Institute, University of Chicago
> > >>> Math & Computer Science Division
> > >>> Argonne National Laboratory
> > >>> Argonne, IL   60439    USA
> > >>> tel 630-252-7497 fax 630-252-1997
> > >>> _______________________________________________
> > >>> Swift-devel mailing list
> > >>> Swift-devel at ci.uchicago.edu
> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>>
> > >>
> > >
> > >
> >
> > --
> > Mike Wilde
> > Computation Institute, University of Chicago
> > Math & Computer Science Division
> > Argonne National Laboratory
> > Argonne, IL   60439    USA
> > tel 630-252-7497 fax 630-252-1997
> >
> 
> 
> -- 
> Tiberiu (Tibi) Stef-Praun, PhD
> Research Staff, Computation Institute
> 5640 S. Ellis Ave, #405
> University of Chicago
> http://www-unix.mcs.anl.gov/~tiberius/
> 




More information about the Swift-devel mailing list