[Swift-devel] mystery runs on ucanl

Mihael Hategan hategan at mcs.anl.gov
Tue Jul 29 14:55:27 CDT 2008


On Tue, 2008-07-29 at 14:34 -0500, skenny at uchicago.edu wrote:
> >> >> yes (see below) and SOME of the jobs in the workflow do
> >> >> complete when we submit the whole workflow to ucanl.
> >> >
> >> >Indeed. It seems like roughly half of them work and the other
> >> half
> >> >break. Could this be an ia32/ia64 issue? Like python being
> >> compiled for
> >> >the wrong platform or something?
> 
> well, i thought that sounded pretty likely (apparently some
> jobs were going to 32-bit machines even though 64 was
> specified in the sites file). however, i've just sent a batch
> to the site and am getting failures on 64-bit nodes as
> well (and on varying nodes, so not just 1 or 2 bum
> nodes)...

The same kinds of failures?

> because there is still this odd behavior of jobs
> remaining in the queue even after they've been killed, i'm
> tempted to blame pbs (gotta blame someone ;) also, i'm getting
> emails from pbs like this:
> 
> PBS Job Id: 1759910.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Exec host:  tg-c054/0
> Aborted by PBS Server 
> Job cannot be executed
> See Administrator for help
> 
> and the swift log simply gives "Failed Error code: 271,
> ProcessDied"

Not the same kind of failures. So we may be dealing with multiple issues
here.

> 
> hence, i'm copying help at teragrid on this...if there are any
> other tests i can run to try and narrow down the bug let me
> know. i've tried submitting several globusrun-ws jobs but
> haven't gotten an error that way as of yet. 




More information about the Swift-devel mailing list