[Swift-devel] Re: swift-falkon problem
Michael Wilde
wilde at mcs.anl.gov
Tue Mar 18 17:12:12 CDT 2008
I will rerun with log4j settings.
Will also try adding the sleep suggested earlier - to see if all jobs
then fail.
I did re-run the workflow 3X on local, and each time all 100 jobs
finished successfully. Also for this dataset, all jobs return data.
- Mike
On 3/18/08 4:45 PM, Ben Clifford wrote:
> On Tue, 18 Mar 2008, Ioan Raicu wrote:
>
>> Could a latency of NFS in which one node creates a
>> file/dir and another node requires xxx time (in this case, 5 sec) before it
>> actually sees the file, explain what Mike is seeing? If this is a likely
>> explanation, then the race condition is that the exit code goes from worker to
>> Falkon service to Swift faster than NFS can update its file/dir list, and when
>> Swift checks for the file or dir (probably within 10s of milliseconds) of the
>> job completion, it can't find the file/dir. Are there any counterarguments
>> that would make this hypothesis not possible? Just another hypothesis which
>> might be worth investigating.
>>
>
> According to the timing in the log file, Swift is getting a notification
> from provider-deef that the job completed before the actual job has even
> been run to completion on the worker, well before the wrapper even
> attempts to write out a status file.
>
> I'm not accusing this of being a problem inside Falkon - I'm saying I
> think its happening somewhere below the Swift layer, so it could well be
> provider-deef, which is probably the most neglected part of this whole
> stack.
>
> Mike, are you running with those extra debug lines in the log4j
> configuration? If not, please run again with them turned on. Also Ioan can
> probably recommend which Falkon logs to keep so we can see what's
> happening for a job there and approach the problem from the other end of
> the stack too.
>
>
More information about the Swift-devel
mailing list