[Swift-devel] Error in syncing job start with input file availability?
Michael Wilde
wilde at mcs.anl.gov
Sun Nov 4 07:45:41 CST 2007
On 11/4/07 12:18 AM, Ben Clifford wrote:
>> But the main suspicious thing above is that while the log shows stagin
>> complete for pc1.pcap at 4:52 past the hour, the ls shows the file
mod date to
>> be 4:55 past the hour, while the job was started (queued?) at 4:52.
>
> mod date is 4:52.
I got my file and job mixed up. The file was pc2.pcap, the mod date was
4:55, but so was the job start time, so that look ok. My mistake.
>
> What site? Can you use a different FS?
>
uc-teragrid. I will experiment with both nfs and gpfs.
Did you determine with Andrew which is faster? More reliable?
On 11/3/07 11:59 PM, Mihael Hategan wrote:
> Looks to me like a problem with PBS rather than something with the jobs.
> So I don't think this is worth investigating. It belongs to the "random
> bad things happen on occasion" class of problems, for which we have
> restarts and scoring.
Possibly. In this case the job was re-run twice (3 total, within a
minute) and all three failed, all got the same PBS error message emailed
to me. I agree, not worth investigating unless it happens more.
grep JOB_START a*.log | grep pc2
2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START
jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap,
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1,
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1]
tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC
2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START
jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap,
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1,
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1]
tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC
2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START
jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap,
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1,
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1]
tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC
vz$
grep EXCEPT a*.log
2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=angle4-tjal0lji - Application exception: No status file was found.
Check the shared filesystem on UC
2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=angle4-vjal0lji - Application exception: No status file was found.
Check the shared filesystem on UC
2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=angle4-xjal0lji - Application exception: No status file was found.
Check the shared filesystem on UC
- Mike
>
> Mihael
>
> On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
>> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like
>> theres a chance that a job attempted to start before its data was
>> visible to the node (not sure, just suspicious).
>>
>> Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for
>> index 1 of a 5-element input file array.
>>
>> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2
>> subdirs below that). So it was on NFS.
>>
>> All 5 input files are in the shared/ dir, but the failing job is the one
>> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
>>
>> I also got 3 emails from PBS of the form:
>>
>> PBS Job Id: 1571647.tg-master.uc.teragrid.org
>> Job Name: STDIN
>> Aborted by PBS Server
>> Job cannot be executed
>> See Administrator for help
>>
>> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
>>
>> Q: Do these email messages indicate that the job was failed by PBS
>> before the app was started, or do these messages indicate a non-zero app
>> exit, eg, if its input file was missing?
>>
>> The input files on shared/ were dated:
>>
>> drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000
>> -0500 _concurrent/
>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000
>> -0500 pc1.pcap
>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000
>> -0500 pc2.pcap
>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000
>> -0500 pc3.pcap
>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000
>> -0500 pc4.pcap
>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000
>> -0500 pc5.pcap
>> -rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000
>> -0500 seq.sh
>> -rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000
>> -0500 wrapper.sh
>>
>> The awf3*.log file shows:
>>
>> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END
>> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
>> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/
>> provider=file
>> 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END
>> jobid=angle4-ujal0lji - Staging in finished
>> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START
>> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
>> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0,
>> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0]
>> tmpdir=awf3\
>> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
>>
>> (Note that the logfile for some reason logs times 1 hour behind???)
>>
>> But the main suspicious thing above is that while the log shows stagin
>> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod
>> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
>>
>> If the job happened to hit the PBS queue right at the time PBS was doing
>> a queue poll, it may have started right away, and somehow started before
>> file pc1.pcap was visible to the worker node. Im not sure what if
>> anything in the synchronization prevents this, especially if NFS
>> close-to-open consistency is broken. (Which we are very suspicious of on
>> this site and with Linux NFS in general).
>>
>> Lastly, i've run the identical workflow twice more now, and its worked
>> with no change both times.
>>
>> Any ideas or other explanations for what may have happened here?
>>
>> Also, ideas why the swift log file shows times an hour behind?
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
>
More information about the Swift-devel
mailing list