[Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry!

Michael Andric mjandric at gmail.com
Thu Jul 24 17:57:42 CDT 2008


it's ucanl (not ncsa) that has been completing a few and declining, e.g.

Progress:  Initializing:73 Selecting site:6922 Executing:5
Mediator completed
Progress:  Initializing:73 Selecting site:6922 Stage out:4 Finished
successfully:1
Mediator completed
Mediator completed
Mediator completed
Mediator completed
Progress:  Initializing:73 Selecting site:6916 Executing:5 Finished
successfully:5 Failed but can retry:1
Failed to transfer wrapper log from PermFriedman-20080724-1033-
7eg450y8/info/z/ANLUCTERAGRID64
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/1/ANLUCTERAGRID64
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/3/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6918 Executing:2 Finished
successfully:5 Failed but can retry:2
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/2/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:2 Finished
successfully:5 Failed but can retry:1
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/9/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:2 Finished
successfully:5 Failed but can retry:1
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/b/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:3 Finished
successfully:5
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/d/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:2 Finished
successfully:5 Failed but can retry:1
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/f/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:2 Finished
successfully:5 Failed but can retry:1
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/h/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:2 Finished
successfully:5 Failed but can retry:1
Failed to transfer wrapper log from
PermFriedman-20080724-1033-7eg450y8/info/j/ANLUCTERAGRID64
Progress:  Initializing:73 Selecting site:6919 Executing:3 Finished
successfully:5
Progress:  Initializing:73 Selecting site:6919 Executing:3 Finished
successfully:5



on ncsa, it seems recently to either all-out work or not work.  yesterday i
got 73 jobs 'Finished successfully' on there and then it just hung, so i
killed it (after letting it hang for a few hours).  today, i couldn't get it
to even start executing (re: the site is down).

and this 'new site', it's been sitting at:

Progress:  Selecting site:6994 Executing:6
Progress:  Selecting site:6994 Executing:6
Progress:  Selecting site:6994 Executing:6

since 2pm this afternoon, still with nothing finished, no errors, no
indication of what's going on...
woo grid computing!


On Thu, Jul 24, 2008 at 5:49 PM, <skenny at uchicago.edu> wrote:

> >On Thu, 2008-07-24 at 17:32 -0500, skenny at uchicago.edu wrote:
> >> yes (see below) and SOME of the jobs in the workflow do
> >> complete when we submit the whole workflow to ucanl.
> >
> >Indeed. It seems like roughly half of them work and the other
> half
> >break. Could this be an ia32/ia64 issue? Like python being
> compiled for
> >the wrong platform or something?
>
> hmm, not quite sure i follow, since we're only sending to ia64
> on this run...how can i test?
>
> >> unfortunately i can't test anything on ncsa right now 'cause
> >> it's down.
> >
> >It being down would generally prevent swift from being able
> to run jobs
> >there. Which is probably what happened the week before.
>
> ha ha, what swift can't run jobs on a site that's down?
> lame! heh, actually we've had a couple of runs now where we
> see the behavior i described on ncsa--e.g. a few jobs
> completing but some failing and an eventual decline. though,
> it's true the site's been up and down quite a bit over the
> past few weeks so could be indicative of something else wrong
> entirely. incidentally, i told them a couple weeks
> ago i was having trouble submitting to gram4 so we switched
> back to gram2 and it *seemed* to be working...for a while.
>
> well, we're trying on yet another site now so if we see more
> of the same we'll know we need to do *something* on our end.
>
> thanks
> sarah
>
>
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080724/68f4f802/attachment.html>


More information about the Swift-devel mailing list