[Swift-devel] Re: swift-falkon problem

Michael Wilde wilde at mcs.anl.gov
Tue Mar 18 17:12:12 CDT 2008


I will rerun with log4j settings.
Will also try adding the sleep suggested earlier - to see if all jobs 
then fail.

I did re-run the workflow 3X on local, and each time all 100 jobs 
finished successfully. Also for this dataset, all jobs return data.

- Mike

On 3/18/08 4:45 PM, Ben Clifford wrote:
> On Tue, 18 Mar 2008, Ioan Raicu wrote:
> 
>> Could a latency of NFS in which one node creates a
>> file/dir and another node requires xxx time (in this case, 5 sec) before it
>> actually sees the file, explain what Mike is seeing?  If this is a likely
>> explanation, then the race condition is that the exit code goes from worker to
>> Falkon service to Swift faster than NFS can update its file/dir list, and when
>> Swift checks for the file or dir (probably within 10s of milliseconds) of the
>> job completion, it can't find the file/dir.  Are there any counterarguments
>> that would make this hypothesis not possible?  Just another hypothesis which
>> might be worth investigating.
>>
> 
> According to the timing in the log file, Swift is getting a notification 
> from provider-deef that the job completed before the actual job has even 
> been run to completion on the worker, well before the wrapper even 
> attempts to write out a status file.
> 
> I'm not accusing this of being a problem inside Falkon - I'm saying I 
> think its happening somewhere below the Swift layer, so it could well be 
> provider-deef, which is probably the most neglected part of this whole 
> stack.
> 
> Mike, are you running with those extra debug lines in the log4j 
> configuration? If not, please run again with them turned on. Also Ioan can 
> probably recommend which Falkon logs to keep so we can see what's 
> happening for a job there and approach the problem from the other end of 
> the stack too.
> 
> 



More information about the Swift-devel mailing list