[Swift-devel] Fwd: hang checker updates

Michael Wilde wilde at mcs.anl.gov
Fri Aug 3 23:19:49 CDT 2012


Mihael, good points.  Much of this belongs in the user guide. I'll paste it into a ticket.

- Mike

----- Forwarded Message -----
From: "Mihael Hategan" <hategan at mcs.anl.gov>
To: "Jared M Chase" <jared.chase at pnnl.gov>
Cc: "Karen L Schuchardt" <Karen.Schuchardt at pnnl.gov>, "Khushbu Agarwal" <Khushbu.Agarwal at pnnl.gov>, "Michael Wilde" <wilde at mcs.anl.gov>, "David Kelly" <davidk at ci.uchicago.edu>, "Justin M. Wozniak" <wozniak at mcs.anl.gov>
Sent: Friday, August 3, 2012 8:30:17 PM
Subject: RE: [Swift-devel] hang checker updates

On Fri, 2012-08-03 at 16:36 -0700, Chase, Jared M wrote:
> The 6 hour run ran for around 4 hours until we got an exception.  The
> exception says it is looking for a file under the directory that we
> are submitting the job (/scratch/scratchdirs/jchase/hybrid) as opposed
> to the work directory
> (/scratch/scratchdirs/jchase/hybrid/work/hw-clean-jared-20120803-1135-uq7t7sae).  Also, there is another file not found exception for one of the info directories ...


I'm having some trouble following this. As far as I can tell, it should
not have gotten as far as it did. It looks like every iteration requires
a "run-n/sph-1/sph.output.h5part", but it looks like iteration 80 is the
last one where such a file is produced.

mike at blabla:~/tmp/swift-bugs/i$ cat
hw-clean-jared-20120803-1135-uq7t7sae.log |egrep
'FILE_STAGE_OUT_START.*h5part.*sph-1 '|awk '{print $9}'

So I suspect that you get this far because you have those files from
previous runs.

I'll make a few observations that may not be immediately apparent from
the swift documentation:

1. with the exception of iterate{} (whose current implementation happens
to order iterations sequentially), execution is based purely on data
dependencies. So in the following example, all instances of trace("A")
will run long before the apps run because they have no dependencies they
need to wait for:
   foreach i in [1:4] {
     outf[i] = cat(inf);
     trace(outf[i]); // you will see these as the cat apps complete
     trace("A"); //you will see 4 "trace: A" in the beginning
   }
2. In the same spirit, trace("A");trace("B"); will either result in "A"
being printed first or in "B" being printed first. There is nothing to
enforce a particular sequence, since there is no dependency between the
two. Syntactic execution ordering does not apply.
3. You are using implicit dependencies in that in an iteration, you map
files that you expect to be there from the previous iteration. The
"proper" way to do it is to explicitly pass those files (not just names)
as arguments between iterations, although I understand that you may have
done the former to go around a swift bug (which is fixed in the latest
swift package I sent). I wouldn't change this now if you are comfortable
with the current scheme, but you need to make sure that things
implicitly passed between iterations are actually created.

> 
> Caused by: org.globus.cog.abstraction.impl.file.FileNotFoundException:
> File not
> found: /scratch/scratchdirs/jchase/hybrid/work/hw-clean-jared-20120803-1135-uq7t7sae/info/j/pg-jp0go4vk-info

Whenever a job fails, swift tries to gather logs related to that job for
post-mortem troubleshooting. Sometimes those logs are not available, and
errors in the attempts to transfer them are logged, but they do not
represent a primary type of error.

Mihael


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list