[Swift-devel] Error in syncing job start with input file availability?

Ben Clifford benc at hawaga.org.uk
Sun Nov 4 00:18:29 CDT 2007



On Sat, 3 Nov 2007, Michael Wilde wrote:

> Q: Do these email messages indicate that the job was failed by PBS before the
> app was started, or do these messages indicate a non-zero app exit, eg, if its
> input file was missing?

I don't know what the different PBS errors mean.

That job never finished as far as swift is concerned (or at least swift 
exited before logging anything) - perhaps you're running with lazy errors 
turned off (which is the default at the moment; I am undecided whether off 
or on is the best default).

> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 -0500
> pc1.pcap

> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START
> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\

> But the main suspicious thing above is that while the log shows stagin
> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod date to
> be 4:55 past the hour, while the job was started (queued?) at 4:52.

mod date is 4:52.

> If the job happened to hit the PBS queue right at the time PBS was doing a
> queue poll, it may have started right away, and somehow started before file
> pc1.pcap was visible to the worker node.  Im not sure what if anything in the
> synchronization prevents this, especially if NFS close-to-open consistency is
> broken. (Which we are very suspicious of on this site and with Linux NFS in
> general).

What site? Can you use a different FS?

-- 




More information about the Swift-devel mailing list