[Swift-devel] Error in syncing job start with input file availability?
Michael Wilde
wilde at mcs.anl.gov
Sat Nov 3 23:34:02 CDT 2007
In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like
theres a chance that a job attempted to start before its data was
visible to the node (not sure, just suspicious).
Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for
index 1 of a 5-element input file array.
The wf ran with ~wilde/swiftdata/* as the storage and work dir (2
subdirs below that). So it was on NFS.
All 5 input files are in the shared/ dir, but the failing job is the one
whose timestamp is last. (0, 2,3,4 worked; 1 failed)
I also got 3 emails from PBS of the form:
PBS Job Id: 1571647.tg-master.uc.teragrid.org
Job Name: STDIN
Aborted by PBS Server
Job cannot be executed
See Administrator for help
all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
Q: Do these email messages indicate that the job was failed by PBS
before the app was started, or do these messages indicate a non-zero app
exit, eg, if its input file was missing?
The input files on shared/ were dated:
drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000
-0500 _concurrent/
-rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000
-0500 pc1.pcap
-rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000
-0500 pc2.pcap
-rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000
-0500 pc3.pcap
-rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000
-0500 pc4.pcap
-rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000
-0500 pc5.pcap
-rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000
-0500 seq.sh
-rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000
-0500 wrapper.sh
The awf3*.log file shows:
2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END
file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/
provider=file
2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END
jobid=angle4-ujal0lji - Staging in finished
2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START
jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0,
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0]
tmpdir=awf3\
-20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
(Note that the logfile for some reason logs times 1 hour behind???)
But the main suspicious thing above is that while the log shows stagin
complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod
date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
If the job happened to hit the PBS queue right at the time PBS was doing
a queue poll, it may have started right away, and somehow started before
file pc1.pcap was visible to the worker node. Im not sure what if
anything in the synchronization prevents this, especially if NFS
close-to-open consistency is broken. (Which we are very suspicious of on
this site and with Linux NFS in general).
Lastly, i've run the identical workflow twice more now, and its worked
with no change both times.
Any ideas or other explanations for what may have happened here?
Also, ideas why the swift log file shows times an hour behind?
More information about the Swift-devel
mailing list