[Swift-devel] [Swift-user] gram on ranger

Michael Wilde wilde at mcs.anl.gov
Tue Nov 8 21:05:08 CST 2011


Sarah, David, All,

Ive just tried to review the length email conversation on this problem (or set of problems) and I found it very hard to discern how many different problem symptoms are involved.

I filed bug 593 to cover the issue that you reported on Oct 11. I know that David and Ketan both worked on this and David made some changes to the provider but there has been no updates to that ticket to help understand where things stand.

Also, there have been several fixes to 0.93 in the past week, but at the same time Ketan has encountered new SGE provide issues (with MPI jobs but perhaps applicable to all jobs) with the latest code, and I just filed those as bug 624.

David and Ketan, please work together to discuss the status, review the specifics of Sarah's issue, and see if she is running with the latest 0.93 revisions. David, since you made the last set of changes to the SGE provider, can you take ownership of this, and update SGE with comments and/or new tickets?

Also, can you try to create a test case that can replicate Sarah's problem?

It seems to me that this problem/thread started off as a "some jobs at the end of a long workflow dont complete", then changed to "SGE is rejecting or my workflow at some point with no explanation", and now we're back to the "last job doesnt complete" symptom.

Lets get the symptoms clearly identified, matched with the logs and with the Swift+CoG revisions being used, and then matched against what other things we suspect are still broken in the SGE provider.

Further, we are not doing any GRAM-SGE testing to my knowledge, yet thats what Sarah is using, so we should add that to the test cases for SGE.  Perhaps we should discuss with Sarah whether a Coaster-SSH-SGE config would help us get in sync and make use f Ranger more reliable, at least until we have time to test the GRAM case.

Thanks,

- Mike


----- Original Message -----
> From: "Sarah Kenny" <skenny at uchicago.edu>
> To: "Justin M Wozniak" <wozniak at mcs.anl.gov>
> Cc: "Anjali Raja" <anjraja at gmail.com>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, November 8, 2011 4:36:42 PM
> Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> thought i'd revisit this since anjali re-ran this workflow with fewer
> jobs (~85K) and perhaps the info would be useful. it showed a similar
> pattern in that it finished all jobs but one (that is, we were missing
> a single output file) and hung indefinitely on the last 'finished
> successfully...'
> 
> so this discussion seems to have turned mostly to how coasters
> requests cores. however, i have to say that *generally* in the past
> when swift/coasters has requested too many cores for the given queue
> gram complains and you see it in the gram log, which is not the case
> here.
> 
> that said, if you want em: the swift log is in /home/skenny/swift_logs
> on ci and the coaster log was too big for my home on ci (and has since
> been appended to so make sure to match the dates with the swift log),
> but if someone has access to ranger it's in /var/tmp/skenny_swift on
> login3
> 
> we're continuing to use the same swift version and sites file since
> it's at least helping us push thru much of the work (doing manual
> resumes/restarts).
> 
> ~sk
> 
> 
> On Fri, Oct 28, 2011 at 11:02 AM, Justin M Wozniak <
> wozniak at mcs.anl.gov > wrote:
> 
> 
> 
> I think count is the number of processes. PBSExecutor uses it, that
> may
> be a good place to look. In the Coasters context, I think it is the
> number of invocations of worker.pl .
> 
> 
> 
> 
> On Fri, 28 Oct 2011, David Kelly wrote:
> 
> > Just to clarify - when coasters is being used, count represents the
> > number of coaster blocks? Then to get the number of cores to
> > request, I
> > should use count*workersPerNode?
> >
> > What about in the case where coasters is not used?
> >
> > ----- Original Message -----
> >> From: "Mihael Hategan" < hategan at mcs.anl.gov >
> >> To: "David Kelly" < davidk at ci.uchicago.edu >
> >> Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" <
> >> swift-devel at ci.uchicago.edu >, "Swift User"
> >> < swift-user at ci.uchicago.edu >, "Ketan Maheshwari" <
> >> ketancmaheshwari at gmail.com >
> >> Sent: Thursday, October 20, 2011 9:08:46 PM
> >> Subject: Re: [Swift-devel] [Swift-user] gram on ranger
> >> On Thu, 2011-10-20 at 21:03 -0500, David Kelly wrote:
> >>> Yep, this is using coasters
> >>>
> >>
> >> Then no. Count is whatever the block allocation algorithm decides
> >> it
> >> should be.
> >>
> >>>>>
> >>>>> Should count=32 in the second case? Am I misunderstanding what
> >>>>> 'count' is? Is there any way to get the exact number of
> >>>>> applications?
> >>>>
> >>>> Coasters?
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> 
> --
> Justin M Wozniak
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> 
> 
> --
> Sarah Kenny
> Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III
> University of California Irvine, Dept. of Neurology ~ 773-818-8300
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list