[Swift-user] Looking for the cause of failure
Mihael Hategan
hategan at mcs.anl.gov
Sat Jan 30 22:25:02 CST 2010
On Sat, 2010-01-30 at 23:10 -0500, Andriy Fedorov wrote:
> On Sat, Jan 30, 2010 at 22:46, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > In ~/.globus/coasters you will find a bunch of worker logs. If you can
> > identify the ones for your run (based perhaps on the timestamp on the
> > files), they may contain the reason for the failure.
> >
>
> Strangely, I don't have worker logs for these executions -- the latest
> are from Jan 18.
That indicates that the workers aren't even started. It's somewhat
unfortunate that GRAM fails to stage out stdout/stderr, because those
would likely contain information about the failure.
What you can probably do in this case is try to reproduce the jobs that
the coasters submit and do it manually with qsub or GRAM to see what the
queuing system complains about.
For that, you could enable log4j debugging for
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.
That would give you the gt2 RSL of the job, and that would likely be
useful.
>
> >> Anybody can explain what happened? The same workflow ran earlier, but
> >> with fewer (2) workers per node.
> >
> > Does it work if you set workers per node to 2 again? If yes, that may be
> > an indication that the workers per node setting causes a problem, and
> > that's a stronger statement than "it doesn't work right now".
> >
>
> I will try, and let you know. If this is indeed the case, is there any
> particular reason why it may not work for 4 workers per node?
>
> As Mike pointed out, the nodes actually have 8 cores.
No idea. I'm pretty much blind about the issue, and in such cases it
seems that the reasonable solution is to use a stick and hit random
things and get a feel for the obstacles around.
Now, Mike's suggestion about using the PBS provider directly seems like
a good one because it provides an alternative mechanism for doing the
same thing which, well, is pretty much like our stick above, except it's
a pretty big stick, so it has decent chances of making a difference.
Also, in case you're there, trunk is unstable code. For more stable
code, use the stable branch (details on the swift download page).
More information about the Swift-user
mailing list