[Swift-devel] Re: swift-falkon problem

Michael Wilde wilde at mcs.anl.gov
Sun Mar 23 21:59:04 CDT 2008


Ben, thanks.

Ive been debugging on this since Friday. I had already moved the sync 
into wrapper.sh when Mihael first mentioned it.

Friday afternoon I moved from a falkon binary drop that Ioan had built 
for me, to a build that I got from SVN and built myself.

When I did that the nature of the problem changed:
- first run after a falkon restart, with the sync in wrapper.sh, worked
fine, at various workflow sizes.
- second run would consistently fail with most jobs missing output, 
status and info files.

Turns out the data was going mostly into the previous workflow's workdir.

After much debugging, the problem was found to be bad message formatting 
in the falkon service, causing the chdir to the workdir to fail. It 
failed very seldom on the initial workflow, and heavily on subsequent 
ones. This problem, too, initially looked like NFS incoherence.

Since that was fixed, Ive been experimenting with workflows of various 
sizes, and have run several 10, 25, 100, 500, and 1000 job workflows, 
all without any sync, and without apparent problems.

Some mysteries remain, as its not clear that this message/chdir fix 
explains the earlier problem. But several Falkon fixes went in as well, 
so there's too many variables to know with confidence whether the 
original problem remains.

Ioan: I do see that we're loosing some workers, so some investigation is 
needed on the Falkon side.

Ben: the swift provenance log records seem excessive: I'll start a 
thread on that.

I'm now going to start performance measurement and tuning on this now 
that things seem stable enough to do repeatable runs.

- Mike


On 3/23/08 7:12 PM, Ben Clifford wrote:
> On Fri, 21 Mar 2008, Mihael Hategan wrote:
> 
>> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote:
>>> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
>>> with a sync command at the end of the application script, all job status
>>> and data is returned ok every time.
>> Why not put it in the wrapper script at the end?
> 
> Mike, the attached patch will do that, and will also add logging 
> information so that we can see how long syncs are taking compared to other 
> stages in worker node execution.
> 
> cd cog/modules/vdsk
> patch -p1 < sync-in-wrapper
> 
> 



More information about the Swift-devel mailing list