[Swift-devel] Re: Jobs being aborted by PBS server on tg-grid.uc.teragrid.org

Michael Wilde wilde at mcs.anl.gov
Tue Nov 6 10:28:36 CST 2007


Excellent, thanks Ti.  This explains many of our problems, I think.

- Mike


On 11/6/07 10:19 AM, help at teragrid.org wrote:
> FROM: Leggett, Ti
> (Concerning ticket No. 147814)
> 
> I think I fixed this this morning. In all the cases you were given a node in which tg-grid1 could not 
> communicate with. If you still see this, immediately run:
> 
> checkjob <jobid>
> 
> if you can and send the output. If you can't, send me the job ID.
> 
> Michael Wilde <help at teragrid.org> writes:
>> The errors below are from workflows of only 5 jobs.
>> One job of the five failed in each of these 3 incidents.
>> The failing job was then in each case retried twice more (automatically 
>> by Swift)
>>
>> GRAM was not failing to my knowledge during these times.
>>
>> Do the PBS logs indicate anything?
>>
>> - Mike
>>
>>
>> On 11/6/07 9:52 AM, help at teragrid.org wrote:
>>> FROM: Leggett, Ti
>>> (Concerning ticket No. 147814)
>>>
>>> Are you getting these when you're submitting many (thousands) of jobs and does it coincide with 
> the 
>>> gatekeeper becoming unavailable?
>>>
>>> Michael Wilde <help at teragrid.org> writes:
>>>> Im starting to see more frequent problems like this.
>>>> Happened once last night to 3 consecutive jobs, and tonight happened 
>>>> twice, to 6 jobs.
>>>>
>>>> Ti, could you look in the PBS logs, possibly on the related node(s) and 
>>>> see if its looking like a problem on tg-uc or on our side?
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>>
>>>> 11/3 8:05 PM - 3 failures
>>>>  Job IDs 1571647, 48, & 49
>>>> 11/4 7:46 PM - 3 failures
>>>>  Job IDs 1572031, 33, & 34
>>>> 11/4 8:56 - 8:57 PM
>>>>  1572040, 42, 43
>>>>
>>>> All errors have the format below.
>>>>
>>>> Swift retries failing jobs 3 times, hence the groups of 3 above.
>>>>
>>>>
>>>> -------- Original Message --------
>>>> Subject: PBS JOB 1572043.tg-master.uc.teragrid.org
>>>> Date: Sun,  4 Nov 2007 20:57:11 -0600 (CST)
>>>> From: adm at tg-master.uc.teragrid.org (root)
>>>> To: wilde at tg-grid1.uc.teragrid.org
>>>>
>>>> PBS Job Id: 1572043.tg-master.uc.teragrid.org
>>>> Job Name:   STDIN
>>>> Aborted by PBS Server
>>>> Job cannot be executed
>>>> See Administrator for help
>>>
> 
> 



More information about the Swift-devel mailing list