[Swift-devel] Error in syncing job start with input file availability?

Sat Nov 3 23:59:32 CDT 2007

Looks to me like a problem with PBS rather than something with the jobs.
So I don't think this is worth investigating. It belongs to the "random
bad things happen on occasion" class of problems, for which we have
restarts and scoring.

Mihael

On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
> theres a chance that a job attempted to start before its data was 
> visible to the node (not sure, just suspicious).
> 
> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
> index 1 of a 5-element input file array.
> 
> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
> subdirs below that). So it was on NFS.
> 
> All 5 input files are in the shared/ dir, but the failing job is the one 
> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
> 
> I also got 3 emails from PBS of the form:
> 
> PBS Job Id: 1571647.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Aborted by PBS Server
> Job cannot be executed
> See Administrator for help
> 
> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
> 
> Q: Do these email messages indicate that the job was failed by PBS 
> before the app was started, or do these messages indicate a non-zero app 
> exit, eg, if its input file was missing?
> 
> The input files on shared/ were dated:
> 
> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
> -0500 _concurrent/
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
> -0500 pc1.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
> -0500 pc2.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
> -0500 pc3.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
> -0500 pc4.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
> -0500 pc5.pcap
> -rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
> -0500 seq.sh
> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
> -0500 wrapper.sh
> 
> The awf3*.log file shows:
> 
> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
> provider=file
> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
> jobid=angle4-ujal0lji - Staging in finished
> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
> tmpdir=awf3\
> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
> 
> (Note that the logfile for some reason logs times 1 hour behind???)
> 
> But the main suspicious thing above is that while the log shows stagin 
> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
> 
> If the job happened to hit the PBS queue right at the time PBS was doing 
> a queue poll, it may have started right away, and somehow started before 
> file pc1.pcap was visible to the worker node.  Im not sure what if 
> anything in the synchronization prevents this, especially if NFS 
> close-to-open consistency is broken. (Which we are very suspicious of on 
> this site and with Linux NFS in general).
> 
> Lastly, i've run the identical workflow twice more now, and its worked 
> with no change both times.
> 
> Any ideas or other explanations for what may have happened here?
> 
> Also, ideas why the swift log file shows times an hour behind?
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>