[Swift-devel] Fwd: [Swift-user] Re: Errors in 13-site OSG run: lazy error question

Michael Wilde wilde at mcs.anl.gov
Sun Aug 29 20:17:13 CDT 2010


Glen, "thing 1" below might be simply having a universal front-end command like swiftrun track the initial args to swift in a local file, so that restart is easier.

But I guess the cmd line arguments could or should be saved in the restart file.

Both sound like projects that David could take on. For now, lets make your fron-end wrappper save a swift.cmd.args file or something like that, for restart.

- Mike


----- Forwarded Message -----
From: "Glen Hocky" <hockyg at gmail.com>
To: "Michael Wilde" <wilde at mcs.anl.gov>
Sent: Sunday, August 29, 2010 8:11:15 PM GMT -06:00 US/Canada Central
Subject: Re: [Swift-user] Re: Errors in 13-site OSG run: lazy error question

oh but two things for the devels that we discussed before 
1) if you could get someone to make restarting slightly easier (i.e. you don't have to specify all options to restart, see earlier email to list host) 
2) tagging the jobs submitted or at least making sure they get pulled out when a job fails or is canceled with the condor provider 


On Sun, Aug 29, 2010 at 9:08 PM, Glen Hocky < hockyg at gmail.com > wrote: 


well 2 sites that would be productive, *vcell* and *mit* (forget exact names) both have jobs failing with "failed to transfer wrapper log" errors but since it works on so many other sites, i think that must be a problem on those sites...if we could work around or get that fixed that would add a lot of machines. otherwise i'm just gonna try to get some productive runs done (almost done one) so we can say that we used OSG productively.... 





On Sun, Aug 29, 2010 at 8:40 PM, Michael Wilde < wilde at mcs.anl.gov > wrote: 


Very good, thanks Glen. 

What's the next prio on this workflow? Still some sites that are not building or running correctly? 




- Mike 

----- "Glen Hocky" < hockyg at gmail.com > wrote: 

> it works now. thanks a lot 
> 
> 
> On Fri, Aug 27, 2010 at 2:52 PM, Glen Hocky < hockyg at gmail.com > 
> wrote: 
> 
> 
> ok i'll try again 
> 
> 
> 
> 
> 
> On Fri, Aug 27, 2010 at 2:49 PM, Michael Wilde < wilde at mcs.anl.gov > 
> wrote: 
> 
> 
> Updated; ~wilde/swift/rev/trunk is now at: swift-r3571 cog-r2868 
> 
> 
> 
> 
> - Mike 
> 
> ----- "Glen Hocky" < hockyg at gmail.com > wrote: 
> 
> > Let me know when you update... 
> > 
> > 
> > Begin forwarded message: 
> > 
> > 
> > 
> > 
> > From: Mihael Hategan < hategan at mcs.anl.gov > 
> > Date: August 27, 2010 2:01:56 PM EDT 
> > To: Glen Hocky < hockyg at gmail.com > 
> > Cc: Mike Wilde < wilde at mcs.anl.gov > 
> > Subject: Re: [Swift-user] Re: Errors in 13-site OSG run: lazy error 
> > question 
> > 
> > 
> > 
> > 
> > 
> > swift trunk r3568 
> > 
> > On Fri, 2010-08-27 at 13:05 -0400, Glen Hocky wrote: 
> > 
> > 
> > in ci-home:~hockyg/for_mihael 
> > 
> > 
> > 
> > 
> > 
> > On Fri, Aug 27, 2010 at 12:41 PM, Mihael Hategan < 
> hategan at mcs.anl.gov 
> > > 
> > 
> > 
> > wrote: 
> > 
> > 
> > Or even the log itself, because I don't think I have access to 
> > 
> > 
> > engage-submit. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Fri, 2010-08-27 at 11:34 -0500, Mihael Hategan wrote: 
> > 
> > 
> > 
> > 
> > Or if you can find the stack trace of that specific error in 
> > 
> > 
> > the log, 
> > 
> > 
> > 
> > 
> > that might be useful. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote: 
> > 
> > 
> > 
> > 
> > 
> > 
> > Glen, as I recall, in the previous incident of this error 
> > 
> > 
> > we re-created with a simpler script, using only the "cat" 
> > 
> > 
> > app(), correct? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Is it possible to re-create this similar error in a 
> > 
> > 
> > similar test script? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Mihael, any thoughts on whether its likely that the prior 
> > 
> > 
> > fix did not address all cases? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Thanks, 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > - Mike 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ----- "Glen Hocky" < hockyg at gmail.com > wrote: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Yes nominally the same error but it's not at the 
> > 
> > 
> > beginning but in the 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > middle now for some reason. I think it's a mid-stated 
> > 
> > 
> > error message. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > I'll attach the log soon 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Aug 27, 2010, at 12:11 AM, Michael Wilde 
> > 
> > 
> > < wilde at mcs.anl.gov > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > wrote: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Glen, I wonder if whats happening here is that Swift 
> > 
> > 
> > will retry and 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > lazily run past *job* errors, but the error below (a 
> > 
> > 
> > mapping error) is 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > maybe being treated as an error in Swift's 
> > 
> > 
> > interpretation of the 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > script itself, and this causes an immediate halt to 
> > 
> > 
> > execution? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Can anyone confirm that this is whats happening, and 
> > 
> > 
> > if it is the 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > expected behavior? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Also, Glen, 2 questions: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 1) Isn't the error below the one that was fixed by 
> > 
> > 
> > Mihael in a 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > recent revision - the same one I looked at earlier in 
> > 
> > 
> > the week? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 2) Do you know what errors the "Failed but can 
> > 
> > 
> > retry:8" message is 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > referring to? 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Where is the log/run directory for this run? How long 
> > 
> > 
> > did it take 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > to get the 589 jobs finished? It would be good to start 
> > 
> > 
> > plotting 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > these large multi-site runs to get a sense of how the 
> > 
> > 
> > scheduler is 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > doing. 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > - Mike 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ----- "Glen Hocky" < hockyg at uchicago.edu > wrote: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > here's the result of my 13 site run that ran while i 
> > 
> > 
> > was out this 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > evening. It did pretty well! 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > but seems to have that problem of not quite lazy 
> > 
> > 
> > errors 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ........ 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Progress: Submitting:3 Submitted:262 Active:147 
> > 
> > 
> > Checking status:3 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Stage out:1 Finished successfully:586 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Progress: Submitting:3 Submitted:262 Active:144 
> > 
> > 
> > Checking status:4 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Stage out:2 Finished successfully:587 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Progress: Submitting:3 Submitted:262 Active:142 Stage 
> > 
> > 
> > out:2 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Finished 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > successfully:587 Failed but can retry:6 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Progress: Submitting:3 Submitted:262 Active:140 
> > 
> > 
> > Finished 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > successfully:589 Failed but can retry:8 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Failed to transfer wrapper log from 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > glassRunCavities-20100826-1718-7gi0dzs1/info/5 on 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > UCHC_CBG_vdgateway.vcell.uchc.edu 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Execution failed: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > org.griphyn.vdl.mapping.InvalidPathException: Invalid 
> > 
> > 
> > path 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > (..logfile) 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > for org.griphyn.vdl.mapping.DataNode identifier 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > tag:benc at ci.uchicago.edu 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 
> > 
> > 
> > type 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > GlassOut 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > with no value at dataset=modelOut path=[3][1][11] 
> > 
> > 
> > (not closed) 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -- 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Michael Wilde 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Computation Institute, University of Chicago 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Mathematics and Computer Science Division 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Argonne National Laboratory 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________ 
> > 
> > 
> > 
> > 
> > Swift-user mailing list 
> > 
> > 
> > 
> > 
> > Swift-user at ci.uchicago.edu 
> > 
> > 
> > 
> > 
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 
> 
> -- 
> 
> 
> 
> Michael Wilde 
> Computation Institute, University of Chicago 
> Mathematics and Computer Science Division 
> Argonne National Laboratory 

-- 



Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list