[Swift-user] Re: Errors in 13-site OSG run: lazy error question

Mihael Hategan hategan at mcs.anl.gov
Fri Aug 27 11:34:05 CDT 2010


Or if you can find the stack trace of that specific error in the log,
that might be useful.

On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote:
> Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
> 
> Is it possible to re-create this similar error in a similar test script?
> 
> Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
> 
> Thanks,
> 
> - Mike
> 
> 
> ----- "Glen Hocky" <hockyg at gmail.com> wrote:
> 
> > Yes nominally the same error but it's not at the beginning but in the
> > middle now for some reason. I think it's a mid-stated error message.
> > I'll attach the log soon
> > 
> > On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov>
> > wrote:
> > 
> > > Glen, I wonder if whats happening here is that Swift will retry and
> > lazily run past *job* errors, but the error below (a mapping error) is
> > maybe being treated as an error in Swift's interpretation of the
> > script itself, and this causes an immediate halt to execution?
> > >
> > > Can anyone confirm that this is whats happening, and if it is the
> > expected behavior?
> > >
> > > Also, Glen, 2 questions:
> > >
> > > 1) Isn't the error below the one that was fixed by Mihael in a
> > recent revision - the same one I looked at earlier in the week?
> > >
> > > 2) Do you know what errors the "Failed but can retry:8" message is
> > referring to?
> > >
> > > Where is the log/run directory for this run?  How long did it take
> > to get the 589 jobs finished?  It would be good to start plotting
> > these large multi-site runs to get a sense of how the scheduler is
> > doing.
> > >
> > > - Mike
> > >
> > >
> > > ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> > >
> > >> here's the result of my 13 site run that ran while i was out this
> > >> evening. It did pretty well!
> > >> but seems to have that problem of not quite lazy errors
> > >> ........
> > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > >> Stage out:1 Finished successfully:586
> > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > >> Stage out:2 Finished successfully:587
> > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> > Finished
> > >> successfully:587 Failed but can retry:6
> > >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> > >> successfully:589 Failed but can retry:8
> > >> Failed to transfer wrapper log from
> > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > >> UCHC_CBG_vdgateway.vcell.uchc.edu
> > >> Execution failed:
> > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> > (..logfile)
> > >> for org.griphyn.vdl.mapping.DataNode identifier
> > >> tag:benc at ci.uchicago.edu
> > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> > GlassOut
> > >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
> 





More information about the Swift-user mailing list