[Swift-devel] Re: swift-falkon problem

Tue Mar 18 16:45:56 CDT 2008

On Tue, 18 Mar 2008, Ioan Raicu wrote:

> Could a latency of NFS in which one node creates a
> file/dir and another node requires xxx time (in this case, 5 sec) before it
> actually sees the file, explain what Mike is seeing?  If this is a likely
> explanation, then the race condition is that the exit code goes from worker to
> Falkon service to Swift faster than NFS can update its file/dir list, and when
> Swift checks for the file or dir (probably within 10s of milliseconds) of the
> job completion, it can't find the file/dir.  Are there any counterarguments
> that would make this hypothesis not possible?  Just another hypothesis which
> might be worth investigating.
> 

According to the timing in the log file, Swift is getting a notification 
from provider-deef that the job completed before the actual job has even 
been run to completion on the worker, well before the wrapper even 
attempts to write out a status file.

I'm not accusing this of being a problem inside Falkon - I'm saying I 
think its happening somewhere below the Swift layer, so it could well be 
provider-deef, which is probably the most neglected part of this whole 
stack.

Mike, are you running with those extra debug lines in the log4j 
configuration? If not, please run again with them turned on. Also Ioan can 
probably recommend which Falkon logs to keep so we can see what's 
happening for a job there and approach the problem from the other end of 
the stack too.

--