[Swift-devel] Update on Teraport problems with wavlet workflow

Wed Feb 28 12:14:19 CST 2007

Here is more info:
Indeed, yesterday I got 175 successful jobs from the total of 192, and
the workflow never ended (it kept retrying transferring fiiles from
the the failed ones, which it failed because they did not exist).
Looking at the processors load and at the transfer load, the total
175jobs were done in about 75minutes (about 10x speedup from a
serialized execution).

At Mihael's suggestion I started with smaller workflows, so here are
the numbers (for the ones that completed successfully):
1 job: 4 minutes
6jobs: 6 minutes
24 jobs: 20 minutes
36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).

I have a total of 192 jobs to run.

I have retried running some of the failed workflows, and they fail
because some task in the workflow is not run correctly. For instance,
the most troubling one was the latest run: the jobs submitted failed
right at the beginning, even though they have run successfully in the
previous run.
My current assumption is that one (?several) cluster nodes are bad.
The failure can be observed in the log in the following way: job gets
submitted, andd 20 seconds later, gram declares is finished (normal
execution time is about 3 minutes), so the workflow attempts to
transfer back some inexistent files (nothing gets generated, neither
outputs, nor stdout,stderr,kickstart in the job's working directory),
and it creates on the submission machine files of size zero. That is
not good because when attempting a -resume, those failed jobs are not
re-considered for execution.

Summary/Speculation: bad teraport node causes job to be declared as
done even though the execution failed

I will move to another Grid site, to run in there locally, and
hopefully not get the same behavior as on teraport.

Tibi

On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> Mihael informs me that the latest problems with the wavlet workflow indicate
> that some number of jobs in the workflow are failing to launch under PBS
> through the pre-WS GRAM provider.  These failing jobs seem to give no
> indication whatsoever where the underlying failure is occurring.
>
> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
> are failing in this manner (not sure I have these numbers right).
>
> Mihael is continuing to experiment to characterize the failure better and will
> report back to the group (and involve the TP and GRAM support teams) when he
> knows more.
>
> - Mike
>
> --
> Mike Wilde
> Computation Institute, University of Chicago
> Math & Computer Science Division
> Argonne National Laboratory
> Argonne, IL   60439    USA
> tel 630-252-7497 fax 630-252-1997
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/