[Swift-devel] Error in syncing job start with input file availability?

Michael Wilde wilde at mcs.anl.gov
Sun Nov 4 20:38:22 CST 2007


A very similar error occured a bit ago, tonight.
Its in ~benc/swiftlogs/wilde/run143
along with the same 3 PBS emailed errors in pbs.errors.out.

This was using r1453.  Upgrading too 1456 now.

Just fyi - dont bother with this till we see it with latest release.

Also, this run was with kickstart; yesterdays was not.

- Mike


On 11/4/07 7:45 AM, Michael Wilde wrote:
> On 11/4/07 12:18 AM, Ben Clifford wrote:
> 
>  >> But the main suspicious thing above is that while the log shows stagin
>  >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file 
> mod date to
>  >> be 4:55 past the hour, while the job was started (queued?) at 4:52.
>  >
>  > mod date is 4:52.
> 
> I got my file and job mixed up. The file was pc2.pcap, the mod date was 
> 4:55, but so was the job start time, so that look ok. My mistake.
> 
>  >
>  > What site? Can you use a different FS?
>  >
> 
> uc-teragrid. I will experiment with both nfs and gpfs.
> Did you determine with Andrew which is faster? More reliable?
> 
> On 11/3/07 11:59 PM, Mihael Hategan wrote:
>> Looks to me like a problem with PBS rather than something with the jobs.
>> So I don't think this is worth investigating. It belongs to the "random
>> bad things happen on occasion" class of problems, for which we have
>> restarts and scoring.
> 
> Possibly. In this case the job was re-run twice (3 total, within a 
> minute) and all three failed, all got the same PBS error message emailed 
> to me. I agree, not worth investigating unless it happens more.
> 
> grep JOB_START a*.log | grep pc2
> 
> 2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap, 
> _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
> tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC
> 2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap, 
> _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
> tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC
> 2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap, 
> _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
> tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC
> vz$
> 
> grep EXCEPT a*.log
> 
> 2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-tjal0lji - Application exception: No status file was found. 
> Check the shared filesystem on UC
> 2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-vjal0lji - Application exception: No status file was found. 
> Check the shared filesystem on UC
> 2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-xjal0lji - Application exception: No status file was found. 
> Check the shared filesystem on UC
> 
> - Mike
> 
>>
>> Mihael
>>
>> On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
>>> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me 
>>> like theres a chance that a job attempted to start before its data 
>>> was visible to the node (not sure, just suspicious).
>>>
>>> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one 
>>> for index 1 of a 5-element input file array.
>>>
>>> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
>>> subdirs below that). So it was on NFS.
>>>
>>> All 5 input files are in the shared/ dir, but the failing job is the 
>>> one whose timestamp is last. (0, 2,3,4 worked; 1 failed)
>>>
>>> I also got 3 emails from PBS of the form:
>>>
>>> PBS Job Id: 1571647.tg-master.uc.teragrid.org
>>> Job Name:   STDIN
>>> Aborted by PBS Server
>>> Job cannot be executed
>>> See Administrator for help
>>>
>>> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
>>>
>>> Q: Do these email messages indicate that the job was failed by PBS 
>>> before the app was started, or do these messages indicate a non-zero 
>>> app exit, eg, if its input file was missing?
>>>
>>> The input files on shared/ were dated:
>>>
>>> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 
>>> 20:04:33.000000000 -0500 _concurrent/
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:52.000000000 -0500 pc1.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:55.000000000 -0500 pc2.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:52.000000000 -0500 pc3.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:47.000000000 -0500 pc4.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:51.000000000 -0500 pc5.pcap
>>> -rw-r--r--    1 wilde    allocate      813 2007-11-03 
>>> 20:04:33.000000000 -0500 seq.sh
>>> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 
>>> 20:04:33.000000000 -0500 wrapper.sh
>>>
>>> The awf3*.log file shows:
>>>
>>> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
>>> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
>>> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
>>> provider=file
>>> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
>>> jobid=angle4-ujal0lji - Staging in finished
>>> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
>>> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
>>> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
>>> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
>>> tmpdir=awf3\
>>> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
>>>
>>> (Note that the logfile for some reason logs times 1 hour behind???)
>>>
>>> But the main suspicious thing above is that while the log shows 
>>> stagin complete for pc1.pcap at 4:52 past the hour, the ls shows the 
>>> file mod date to be 4:55 past the hour, while the job was started 
>>> (queued?) at 4:52.
>>>
>>> If the job happened to hit the PBS queue right at the time PBS was 
>>> doing a queue poll, it may have started right away, and somehow 
>>> started before file pc1.pcap was visible to the worker node.  Im not 
>>> sure what if anything in the synchronization prevents this, 
>>> especially if NFS close-to-open consistency is broken. (Which we are 
>>> very suspicious of on this site and with Linux NFS in general).
>>>
>>> Lastly, i've run the identical workflow twice more now, and its 
>>> worked with no change both times.
>>>
>>> Any ideas or other explanations for what may have happened here?
>>>
>>> Also, ideas why the swift log file shows times an hour behind?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 



More information about the Swift-devel mailing list