[Swift-devel] Error in syncing job start with input file availability?
Mihael Hategan
hategan at mcs.anl.gov
Sat Nov 3 23:59:32 CDT 2007
Looks to me like a problem with PBS rather than something with the jobs.
So I don't think this is worth investigating. It belongs to the "random
bad things happen on occasion" class of problems, for which we have
restarts and scoring.
Mihael
On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like
> theres a chance that a job attempted to start before its data was
> visible to the node (not sure, just suspicious).
>
> Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for
> index 1 of a 5-element input file array.
>
> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2
> subdirs below that). So it was on NFS.
>
> All 5 input files are in the shared/ dir, but the failing job is the one
> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
>
> I also got 3 emails from PBS of the form:
>
> PBS Job Id: 1571647.tg-master.uc.teragrid.org
> Job Name: STDIN
> Aborted by PBS Server
> Job cannot be executed
> See Administrator for help
>
> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
>
> Q: Do these email messages indicate that the job was failed by PBS
> before the app was started, or do these messages indicate a non-zero app
> exit, eg, if its input file was missing?
>
> The input files on shared/ were dated:
>
> drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000
> -0500 _concurrent/
> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000
> -0500 pc1.pcap
> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000
> -0500 pc2.pcap
> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000
> -0500 pc3.pcap
> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000
> -0500 pc4.pcap
> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000
> -0500 pc5.pcap
> -rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000
> -0500 seq.sh
> -rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000
> -0500 wrapper.sh
>
> The awf3*.log file shows:
>
> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END
> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/
> provider=file
> 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END
> jobid=angle4-ujal0lji - Staging in finished
> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START
> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0,
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0]
> tmpdir=awf3\
> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
>
> (Note that the logfile for some reason logs times 1 hour behind???)
>
> But the main suspicious thing above is that while the log shows stagin
> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod
> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
>
> If the job happened to hit the PBS queue right at the time PBS was doing
> a queue poll, it may have started right away, and somehow started before
> file pc1.pcap was visible to the worker node. Im not sure what if
> anything in the synchronization prevents this, especially if NFS
> close-to-open consistency is broken. (Which we are very suspicious of on
> this site and with Linux NFS in general).
>
> Lastly, i've run the identical workflow twice more now, and its worked
> with no change both times.
>
> Any ideas or other explanations for what may have happened here?
>
> Also, ideas why the swift log file shows times an hour behind?
>
>
>
>
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list