[Swift-devel] 1000-job angle workflow gets high failure rate
Michael Wilde
wilde at mcs.anl.gov
Mon Nov 5 22:27:55 CST 2007
Was: Re: [Swift-devel] Re: high load on tg-grid1
Ben, the logs of my first 1000-job run for this week is in
swift-logs/wilde/run153.
This run shows a high volume (396) of the same emailed PBS error
"Aborted by PBS Server" that I first saw on Saturday night. (although it
turns out I now see these sporadically in my email going back to august)
It produced 469 kickstart records and 1064 (out of 2000) data files.
I assume the data files came in pairs, that would be 532 succeeding
jobs. Its odd that 469+532=1001, but perhaps a coincidence.
Im not going to take this log apart yet; first I want to rerun with
clustering, and check my throttles. Possible that throttles open too
wide are causing the PBS failures? Possible the same issue for the
5-wide angle run???
Also: I thought kickstart recs would get returned in a directory tree, no?
Lastly, I'd like to get the input files mapped from a tree structure.
Can structured_regexp_mapper do that? Ie, can I set its source to a dir
rather than a swift variable? (You might have explained that, but I
didnt get it in my notes). If the args to this have some powerful
variations, can you fire off a note describing?
Thanks,
Mike
On 11/5/07 9:43 PM, Michael Wilde wrote:
> Joe, I started a workflow with 1000 jobs - most likely thats what caused
> this. I need to check the throttles on this workflow - its possible they
> were open too wide.
>
> Another possibility - not sure if this was cause or effect - was that I
> got hundreds of messages from PBS (job aborted messages) of the form
> that I reported to help at tg yesterday.
>
> Im about to investigate the logs, but all my jobs are out of the queue
> now, and the workflow has completed.
>
> (Ben: I'll be filing the log momentarily after I do an initial check of
> it. Of 1000 jobs I got about 533 result datasets returned. This was w/o
> clustering). I got 396 emails from PBS.
>
> - Mike
>
> (Ti: responding to tg-support as thats where Joe sent this...)
>
> On 11/5/07 9:15 PM, joseph insley wrote:
>> I'm not sure what was causing this, but the load on tg-grid1 spiked at
>> over 200 a short while ago. It's coming back down now, but while it
>> was high I tried to submit a job through GRAM (pre-WS) and after a
>> long wait I got the error "GRAM Job submission failed because an I/O
>> operation failed (error code 3)"
>>
>> At the time there were a number of globus-job-manager processes
>> belonging to Mike Wilde, but only on the order of ~30something.. it
>> doesn't seem like this should cause such a high load, so I don't know
>> what was up...
>>
>> joe.
>>
>> ===================================================
>> joseph a. insley
>> insley at mcs.anl.gov
>> mathematics & computer science division (630) 252-5649
>> argonne national laboratory (630)
>> 252-5986 (fax)
>>
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
More information about the Swift-devel
mailing list