[Swift-devel] Re: swift-falkon problem

Ioan Raicu iraicu at cs.uchicago.edu
Tue Mar 18 17:20:26 CDT 2008


The clocks on the two machines that Mike was running on seems to be in 
sync (less than 1 sec off).

iraicu at bblogin:~/java/svn/falkon$ date
Tue Mar 18 17:10:15 CDT 2008

iraicu at scx-m23n6 ~/java/svn/falkon/worker/temp $ date
Tue Mar 18 17:10:15 CDT 2008

Mike, here are the logs you need to make sure you capture when running 
in debug mode:
iraicu at viper:~/java/svn/falkon/config> cat Falkon-TCPCore.config
GenericPortalWS=falkon_task_submission_history.txt
GenericPortalWS_perf_per_sec=falkon_summary.txt
GenericPortalWS_taskPerf=falkon_task_perf.txt
GenericPortalWS_task=falkon_task_status.txt

When running in normal mode (when we know things work fine), we just need
iraicu at viper:~/java/svn/falkon/config> cat Falkon-TCPCore.config
GenericPortalWS_perf_per_sec=falkon_summary.txt
GenericPortalWS_taskPerf=falkon_task_perf.txt

In the event that we can't figure out things from the Swift and Falkon 
service logs, we might have to enable worker side logs as well, which 
you do from the run.worker-c.sh (or run.worker-c-ram.sh) script(s).

Its also possible that the Falkon provider code is doing something 
funny, but I'd want to see the Falkon logs before we focus on the provider.

Ioan


Ben Clifford wrote:
> On Tue, 18 Mar 2008, Ioan Raicu wrote:
>
>   
>> Could a latency of NFS in which one node creates a
>> file/dir and another node requires xxx time (in this case, 5 sec) before it
>> actually sees the file, explain what Mike is seeing?  If this is a likely
>> explanation, then the race condition is that the exit code goes from worker to
>> Falkon service to Swift faster than NFS can update its file/dir list, and when
>> Swift checks for the file or dir (probably within 10s of milliseconds) of the
>> job completion, it can't find the file/dir.  Are there any counterarguments
>> that would make this hypothesis not possible?  Just another hypothesis which
>> might be worth investigating.
>>
>>     
>
> According to the timing in the log file, Swift is getting a notification 
> from provider-deef that the job completed before the actual job has even 
> been run to completion on the worker, well before the wrapper even 
> attempts to write out a status file.
>
> I'm not accusing this of being a problem inside Falkon - I'm saying I 
> think its happening somewhere below the Swift layer, so it could well be 
> provider-deef, which is probably the most neglected part of this whole 
> stack.
>
> Mike, are you running with those extra debug lines in the log4j 
> configuration? If not, please run again with them turned on. Also Ioan can 
> probably recommend which Falkon logs to keep so we can see what's 
> happening for a job there and approach the problem from the other end of 
> the stack too.
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================





More information about the Swift-devel mailing list