[Swift-devel] Update on Teraport problems with wavlet workflow

Wed Feb 28 12:37:31 CST 2007

Do we need to file a bug to improve the processing of the missing-file case?

Ie., if the file is truly missing, this should (typically?) not cause great 
delays in the workflow proceeding, or proceeding to fail quickly.

- Mike

Mihael Hategan wrote, On 2/28/2007 12:31 PM:
> On Wed, 2007-02-28 at 12:14 -0600, Tiberiu Stef-Praun wrote:
>> Here is more info:
>> Indeed, yesterday I got 175 successful jobs from the total of 192, and
>> the workflow never ended 
> 
> Did the workflow lock up or did you interrupt it because you got tired
> of it trying to transfer all the missing files?
> 
>> (it kept retrying transferring fiiles from
>> the the failed ones, which it failed because they did not exist).
>> Looking at the processors load and at the transfer load, the total
>> 175jobs were done in about 75minutes (about 10x speedup from a
>> serialized execution).
>>
>> At Mihael's suggestion I started with smaller workflows, so here are
>> the numbers (for the ones that completed successfully):
>> 1 job: 4 minutes
>> 6jobs: 6 minutes
>> 24 jobs: 20 minutes
>> 36 jobs: 25 minutes(10 minutes execution+15minutes data transfer).
>>
>> I have a total of 192 jobs to run.
>>
>>
>> I have retried running some of the failed workflows, and they fail
>> because some task in the workflow is not run correctly. For instance,
>> the most troubling one was the latest run: the jobs submitted failed
>> right at the beginning, even though they have run successfully in the
>> previous run.
>> My current assumption is that one (?several) cluster nodes are bad.
>> The failure can be observed in the log in the following way: job gets
>> submitted, andd 20 seconds later, gram declares is finished (normal
>> execution time is about 3 minutes), so the workflow attempts to
>> transfer back some inexistent files (nothing gets generated, neither
>> outputs, nor stdout,stderr,kickstart in the job's working directory),
>> and it creates on the submission machine files of size zero. That is
>> not good because when attempting a -resume, those failed jobs are not
>> re-considered for execution.
>>
>> Summary/Speculation: bad teraport node causes job to be declared as
>> done even though the execution failed
>>
>> I will move to another Grid site, to run in there locally, and
>> hopefully not get the same behavior as on teraport.
>>
>> Tibi
>>
>> On 2/28/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
>>> Mihael informs me that the latest problems with the wavlet workflow indicate
>>> that some number of jobs in the workflow are failing to launch under PBS
>>> through the pre-WS GRAM provider.  These failing jobs seem to give no
>>> indication whatsoever where the underlying failure is occurring.
>>>
>>> I think Tibi indicated yesterday that about 25 jobs out of 200 parallel jobs
>>> are failing in this manner (not sure I have these numbers right).
>>>
>>> Mihael is continuing to experiment to characterize the failure better and will
>>> report back to the group (and involve the TP and GRAM support teams) when he
>>> knows more.
>>>
>>> - Mike
>>>
>>> --
>>> Mike Wilde
>>> Computation Institute, University of Chicago
>>> Math & Computer Science Division
>>> Argonne National Laboratory
>>> Argonne, IL   60439    USA
>>> tel 630-252-7497 fax 630-252-1997
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997