[Swift-devel] Error in syncing job start with input file availability?

Sun Nov 4 07:45:41 CST 2007

On 11/4/07 12:18 AM, Ben Clifford wrote:

 >> But the main suspicious thing above is that while the log shows stagin
 >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file 
mod date to
 >> be 4:55 past the hour, while the job was started (queued?) at 4:52.
 >
 > mod date is 4:52.

I got my file and job mixed up. The file was pc2.pcap, the mod date was 
4:55, but so was the job start time, so that look ok. My mistake.

 >
 > What site? Can you use a different FS?
 >

uc-teragrid. I will experiment with both nfs and gpfs.
Did you determine with Andrew which is faster? More reliable?

On 11/3/07 11:59 PM, Mihael Hategan wrote:
> Looks to me like a problem with PBS rather than something with the jobs.
> So I don't think this is worth investigating. It belongs to the "random
> bad things happen on occasion" class of problems, for which we have
> restarts and scoring.

Possibly. In this case the job was re-run twice (3 total, within a 
minute) and all three failed, all got the same PBS error message emailed 
to me. I agree, not worth investigating unless it happens more.

grep JOB_START a*.log | grep pc2

2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap, 
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC
2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap, 
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC
2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap, 
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC
vz$

grep EXCEPT a*.log

2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-tjal0lji - Application exception: No status file was found. 
Check the shared filesystem on UC
2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-vjal0lji - Application exception: No status file was found. 
Check the shared filesystem on UC
2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-xjal0lji - Application exception: No status file was found. 
Check the shared filesystem on UC

- Mike

> 
> Mihael
> 
> On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
>> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
>> theres a chance that a job attempted to start before its data was 
>> visible to the node (not sure, just suspicious).
>>
>> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
>> index 1 of a 5-element input file array.
>>
>> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
>> subdirs below that). So it was on NFS.
>>
>> All 5 input files are in the shared/ dir, but the failing job is the one 
>> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
>>
>> I also got 3 emails from PBS of the form:
>>
>> PBS Job Id: 1571647.tg-master.uc.teragrid.org
>> Job Name:   STDIN
>> Aborted by PBS Server
>> Job cannot be executed
>> See Administrator for help
>>
>> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
>>
>> Q: Do these email messages indicate that the job was failed by PBS 
>> before the app was started, or do these messages indicate a non-zero app 
>> exit, eg, if its input file was missing?
>>
>> The input files on shared/ were dated:
>>
>> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
>> -0500 _concurrent/
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
>> -0500 pc1.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
>> -0500 pc2.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
>> -0500 pc3.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
>> -0500 pc4.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
>> -0500 pc5.pcap
>> -rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
>> -0500 seq.sh
>> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
>> -0500 wrapper.sh
>>
>> The awf3*.log file shows:
>>
>> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
>> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
>> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
>> provider=file
>> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
>> jobid=angle4-ujal0lji - Staging in finished
>> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
>> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
>> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
>> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
>> tmpdir=awf3\
>> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
>>
>> (Note that the logfile for some reason logs times 1 hour behind???)
>>
>> But the main suspicious thing above is that while the log shows stagin 
>> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
>> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
>>
>> If the job happened to hit the PBS queue right at the time PBS was doing 
>> a queue poll, it may have started right away, and somehow started before 
>> file pc1.pcap was visible to the worker node.  Im not sure what if 
>> anything in the synchronization prevents this, especially if NFS 
>> close-to-open consistency is broken. (Which we are very suspicious of on 
>> this site and with Linux NFS in general).
>>
>> Lastly, i've run the identical workflow twice more now, and its worked 
>> with no change both times.
>>
>> Any ideas or other explanations for what may have happened here?
>>
>> Also, ideas why the swift log file shows times an hour behind?
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
>