[Swift-devel] Error in syncing job start with input file availability?

Sat Nov 3 23:34:02 CDT 2007

In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
theres a chance that a job attempted to start before its data was 
visible to the node (not sure, just suspicious).

Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
index 1 of a 5-element input file array.

The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
subdirs below that). So it was on NFS.

All 5 input files are in the shared/ dir, but the failing job is the one 
whose timestamp is last. (0, 2,3,4 worked; 1 failed)

I also got 3 emails from PBS of the form:

PBS Job Id: 1571647.tg-master.uc.teragrid.org
Job Name:   STDIN
Aborted by PBS Server
Job cannot be executed
See Administrator for help

all dated 8:05 PM, three consecutive job ids, *47, 48, 49.

Q: Do these email messages indicate that the job was failed by PBS 
before the app was started, or do these messages indicate a non-zero app 
exit, eg, if its input file was missing?

The input files on shared/ were dated:

drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
-0500 _concurrent/
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
-0500 pc1.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
-0500 pc2.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
-0500 pc3.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
-0500 pc4.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
-0500 pc5.pcap
-rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
-0500 seq.sh
-rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
-0500 wrapper.sh

The awf3*.log file shows:

2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
provider=file
2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
jobid=angle4-ujal0lji - Staging in finished
2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
tmpdir=awf3\
-20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC

(Note that the logfile for some reason logs times 1 hour behind???)

But the main suspicious thing above is that while the log shows stagin 
complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
date to be 4:55 past the hour, while the job was started (queued?) at 4:52.

If the job happened to hit the PBS queue right at the time PBS was doing 
a queue poll, it may have started right away, and somehow started before 
file pc1.pcap was visible to the worker node.  Im not sure what if 
anything in the synchronization prevents this, especially if NFS 
close-to-open consistency is broken. (Which we are very suspicious of on 
this site and with Linux NFS in general).

Lastly, i've run the identical workflow twice more now, and its worked 
with no change both times.

Any ideas or other explanations for what may have happened here?

Also, ideas why the swift log file shows times an hour behind?