[Swift-devel] Re: Jobs being aborted by PBS server on tg-grid.uc.teragrid.org
Michael Wilde
wilde at mcs.anl.gov
Tue Nov 6 10:28:36 CST 2007
Excellent, thanks Ti. This explains many of our problems, I think.
- Mike
On 11/6/07 10:19 AM, help at teragrid.org wrote:
> FROM: Leggett, Ti
> (Concerning ticket No. 147814)
>
> I think I fixed this this morning. In all the cases you were given a node in which tg-grid1 could not
> communicate with. If you still see this, immediately run:
>
> checkjob <jobid>
>
> if you can and send the output. If you can't, send me the job ID.
>
> Michael Wilde <help at teragrid.org> writes:
>> The errors below are from workflows of only 5 jobs.
>> One job of the five failed in each of these 3 incidents.
>> The failing job was then in each case retried twice more (automatically
>> by Swift)
>>
>> GRAM was not failing to my knowledge during these times.
>>
>> Do the PBS logs indicate anything?
>>
>> - Mike
>>
>>
>> On 11/6/07 9:52 AM, help at teragrid.org wrote:
>>> FROM: Leggett, Ti
>>> (Concerning ticket No. 147814)
>>>
>>> Are you getting these when you're submitting many (thousands) of jobs and does it coincide with
> the
>>> gatekeeper becoming unavailable?
>>>
>>> Michael Wilde <help at teragrid.org> writes:
>>>> Im starting to see more frequent problems like this.
>>>> Happened once last night to 3 consecutive jobs, and tonight happened
>>>> twice, to 6 jobs.
>>>>
>>>> Ti, could you look in the PBS logs, possibly on the related node(s) and
>>>> see if its looking like a problem on tg-uc or on our side?
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>>
>>>> 11/3 8:05 PM - 3 failures
>>>> Job IDs 1571647, 48, & 49
>>>> 11/4 7:46 PM - 3 failures
>>>> Job IDs 1572031, 33, & 34
>>>> 11/4 8:56 - 8:57 PM
>>>> 1572040, 42, 43
>>>>
>>>> All errors have the format below.
>>>>
>>>> Swift retries failing jobs 3 times, hence the groups of 3 above.
>>>>
>>>>
>>>> -------- Original Message --------
>>>> Subject: PBS JOB 1572043.tg-master.uc.teragrid.org
>>>> Date: Sun, 4 Nov 2007 20:57:11 -0600 (CST)
>>>> From: adm at tg-master.uc.teragrid.org (root)
>>>> To: wilde at tg-grid1.uc.teragrid.org
>>>>
>>>> PBS Job Id: 1572043.tg-master.uc.teragrid.org
>>>> Job Name: STDIN
>>>> Aborted by PBS Server
>>>> Job cannot be executed
>>>> See Administrator for help
>>>
>
>
More information about the Swift-devel
mailing list