[Swift-user] Re: Errors in 13-site OSG run: lazy error question

Mihael Hategan hategan at mcs.anl.gov
Fri Aug 27 11:41:07 CDT 2010


Or even the log itself, because I don't think I have access to
engage-submit.

On Fri, 2010-08-27 at 11:34 -0500, Mihael Hategan wrote:
> Or if you can find the stack trace of that specific error in the log,
> that might be useful.
> 
> On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote:
> > Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
> > 
> > Is it possible to re-create this similar error in a similar test script?
> > 
> > Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
> > 
> > Thanks,
> > 
> > - Mike
> > 
> > 
> > ----- "Glen Hocky" <hockyg at gmail.com> wrote:
> > 
> > > Yes nominally the same error but it's not at the beginning but in the
> > > middle now for some reason. I think it's a mid-stated error message.
> > > I'll attach the log soon
> > > 
> > > On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov>
> > > wrote:
> > > 
> > > > Glen, I wonder if whats happening here is that Swift will retry and
> > > lazily run past *job* errors, but the error below (a mapping error) is
> > > maybe being treated as an error in Swift's interpretation of the
> > > script itself, and this causes an immediate halt to execution?
> > > >
> > > > Can anyone confirm that this is whats happening, and if it is the
> > > expected behavior?
> > > >
> > > > Also, Glen, 2 questions:
> > > >
> > > > 1) Isn't the error below the one that was fixed by Mihael in a
> > > recent revision - the same one I looked at earlier in the week?
> > > >
> > > > 2) Do you know what errors the "Failed but can retry:8" message is
> > > referring to?
> > > >
> > > > Where is the log/run directory for this run?  How long did it take
> > > to get the 589 jobs finished?  It would be good to start plotting
> > > these large multi-site runs to get a sense of how the scheduler is
> > > doing.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> > > >
> > > >> here's the result of my 13 site run that ran while i was out this
> > > >> evening. It did pretty well!
> > > >> but seems to have that problem of not quite lazy errors
> > > >> ........
> > > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > > >> Stage out:1 Finished successfully:586
> > > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > > >> Stage out:2 Finished successfully:587
> > > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> > > Finished
> > > >> successfully:587 Failed but can retry:6
> > > >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> > > >> successfully:589 Failed but can retry:8
> > > >> Failed to transfer wrapper log from
> > > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > > >> UCHC_CBG_vdgateway.vcell.uchc.edu
> > > >> Execution failed:
> > > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> > > (..logfile)
> > > >> for org.griphyn.vdl.mapping.DataNode identifier
> > > >> tag:benc at ci.uchicago.edu
> > > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> > > GlassOut
> > > >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user





More information about the Swift-user mailing list