[Swift-devel] 1000-job angle workflow gets high failure rate

Mon Nov 5 22:27:55 CST 2007

Was: Re: [Swift-devel] Re: high load on tg-grid1

Ben, the logs of my first 1000-job run for this week is in 
swift-logs/wilde/run153.

This run shows a high volume (396) of the same emailed PBS error 
"Aborted by PBS Server" that I first saw on Saturday night. (although it 
turns out I now see these sporadically in my email going back to august)

It produced 469 kickstart records and 1064 (out of 2000) data files.
I assume the data files came in pairs, that would be 532 succeeding 
jobs.  Its odd that 469+532=1001, but perhaps a coincidence.

Im not going to take this log apart yet; first I want to rerun with 
clustering, and check my throttles.  Possible that throttles open too 
wide are causing the PBS failures?   Possible the same issue for the 
5-wide angle run???

Also: I thought kickstart recs would get returned in a directory tree, no?

Lastly, I'd like to get the input files mapped from a tree structure.
Can structured_regexp_mapper do that? Ie, can I set its source to a dir 
rather than a swift variable? (You might have explained that, but I 
didnt get it in my notes).  If the args to this have some powerful 
variations, can you fire off a note describing?

Thanks,

Mike

On 11/5/07 9:43 PM, Michael Wilde wrote:
> Joe, I started a workflow with 1000 jobs - most likely thats what caused 
> this. I need to check the throttles on this workflow - its possible they 
> were open too wide.
> 
> Another possibility - not sure if this was cause or effect - was that I 
> got hundreds of messages from PBS (job aborted messages) of the form 
> that I reported to help at tg yesterday.
> 
> Im about to investigate the logs, but all my jobs are out of the queue 
> now, and the workflow has completed.
> 
> (Ben: I'll be filing the log momentarily after I do an initial check of 
> it. Of 1000 jobs I got about 533 result datasets returned. This was w/o 
> clustering). I got 396 emails from PBS.
> 
> - Mike
> 
> (Ti: responding to tg-support as thats where Joe sent this...)
> 
> On 11/5/07 9:15 PM, joseph insley wrote:
>> I'm not sure what was causing this, but the load on tg-grid1 spiked at 
>> over 200 a short while ago.  It's coming back down now, but while it 
>> was high I tried to submit a job through GRAM (pre-WS) and after a 
>> long wait I got the error "GRAM Job submission failed because an I/O 
>> operation failed (error code 3)"
>>
>> At the time there were a number of globus-job-manager processes 
>> belonging to Mike Wilde, but only on the order of ~30something.. it 
>> doesn't seem like this should cause such a high load, so I don't know 
>> what was up...
>>
>> joe.
>>
>> ===================================================
>> joseph a. insley                                                      
>> insley at mcs.anl.gov
>> mathematics & computer science division       (630) 252-5649
>> argonne national laboratory                               (630) 
>> 252-5986 (fax)
>>
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
>