[Swift-devel] Re: Fault tolerance in "many task computing"?

Mihael Hategan hategan at mcs.anl.gov
Mon Mar 2 17:50:42 CST 2009


Is there a Java library for FTB?

What does FTB bring new to the table compared to a distributed messaging
system?

Mihael

On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote:
> All,
> 
> Pete suggested we take a look at CIFTS's message logging system and 
> consider integrating it into our stack. Rinku gave me, Allan, and Zhao 
> and excellent overview and demo of the system. (Thanks, Rinku!)
> 
> Here's my notes from this meeting. My intent is just to start a 
> discussion for longer-term consideration, not any near-term action.
> (Although Jing Tie may find some of these concepts fruitful for er 
> troubleshooting research).
> 
> CIFTS is the DOE SciDAC project "Coordinated and Improved Fault 
> Tolerance for High Performance Computing Systems", PI'd by Pete:
> http://www.mcs.anl.gov/research/cifts/index.php
> 
> It produces "FTB", a backplane for distributing logging information 
> within a distributed system:
> 
> http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf
> 
> I pointed Rinku to Swift and Falkon info, as well as Netlogger and 
> activities related to it in the CEDPS project, and we have a joint 
> action item to understand the possible overlap and integration issues 
> and possibilities between these two systems.
> 
> Netlogger and CEDPS info is at:
> 
> http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
> http://dev.globus.org/wiki/Incubator/NetLogger
> http://www.cedps.net/index.php/Troubleshooting#Work-in-progress
> 
> I mentioned that we have invested a small bit of effort in integrating 
> Netlogger log publishing capabilities into Swift.
> 
> Potential overlap notwithstanding, CIFTS (and in particular the Fault 
> Tolerant Backplane, FTB), could serve as a very nice consolidation 
> service for log information originating in the many different components 
> involved in executing a Swift program:
> 
> - the application program wrapper script
> - the Falkon or Coaster worker agent
> - the Globus job manager and/or local scheduler
> - the worker node
> - the remote site fileserver/filesystem
> - a site system management facility like BG/P's RAS service
> - Falkon and Coaster servers and bootstrappers
> - the swift client-side engine
> - GrifFTP and other transport protocols and services
> - etc
> 
> FTB would enable us to readily capture and consolidate all these 
> information sources and funnel the data into streams related to specific 
> Swift program executions. It has the infrastructure to route messages 
> out of distributed systems, and to permit publication of and 
> subscription to message streams. Its agents, it seems, can help messages 
> traverse firewalls and deal with other transport and delivery issues.
> 
> FTB is implemented as a C API, and comes with a set of example clients. 
>  From this a simple set of command line interfaces could be derived to 
> permit low-cost experimentation with the system in, eg, Falkon on the 
> BG/P, where Rinku and others are implementing collectors to gather log 
> information from different parts of ZeptoOS and the BG/P hardware complex.
> 
> Its not clear that any of us have the cycles within the next two months 
> to explore this, but it would make an interesting student project, to 
> compare CIFTS and NetLogger, and to test some initial integrations into 
> Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).
> 
> My initial question is whether some CIFTS/FTB hooks could be planted in 
> a lightweight Swift experiment, and we could try to get a feel for 
> whether the infrastructure gives us something that we cant readily get 
> today.  My gut feel is that is does.
> 
> I think it would be a great research/development topic to explore how 
> close this could bring us to the point where all distributed errors are 
> cleanly routed back to the centralized user to more quickly pinpoint the 
> cause of remote and distributed failures.  Swift does a *pretty* good 
> job of this today, albeit in a somewhat ad-hoc fashion. FTB would make 
> it easier to integrate information from additional sources like the 
> remote scheduler and BGP RAS logs into the debugging process.
> 
> And all that is before we even consider the goals of automating fault 
> tolerance, which I think is the ultimate vision of CIFTS.
> 
> Thoughts and discussion welcome. Once any of us get a day or so to play 
> with FTB, we'll know more about the possibilities.
> 
> Regards,
> 
> Mike
> 
> 
> On 3/1/09 11:11 AM, Ioan Raicu wrote:
> > Hi Rinku,
> > It looks like I am not going to be able to make the meeting tomorrow. On 
> > Friday, another interview opportunity came up, and the only open slot 
> > for the next 2 weeks was this Monday. Sorry about the short notice. Go 
> > ahead and meet without me, and I'll catch up with what was discussed at 
> > the meeting from Mike.
> > 
> > Thanks,
> > Ioan
> > 
> > Michael Wilde wrote:
> >> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2, 
> >> or by phone.
> >>
> >> - Mike
> >>
> >>
> >> On 2/18/09 10:30 PM, Ian Foster wrote:
> >>> Hi,
> >>>
> >>> This sounds like a really fun project. Maybe we should involve Zhao 
> >>> and Allen as well, given that Ioan has (sadly) graduated, and will 
> >>> leave us?
> >>> I'd love to participate, I will need to do so by phone--could we do 
> >>> that? I'll just listen in, and see what I can learn.
> >>>
> >>> Ian.
> >>>
> >>>
> >>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
> >>>
> >>>> Great!
> >>>>
> >>>> I added Ian as a cc, maybe he wants to come to this meeting as well. 
> >>>> Ian, the original message from Pete was:
> >>>>> Ioan and Mike,
> >>>>>
> >>>>> The CIFTS project is a DOE project to provide a "fault tolerant 
> >>>>> backplane".  I'm the PI of the project which involved ORNL, LBL, 
> >>>>> IU, Ohio State, and UTK.  Below is a suggestion to hook CIFTS to 
> >>>>> Falkon, so faults could be monitored.  Rinku (on the cc: line) is 
> >>>>> the lead developer for CIFTS.  Maybe when one of you is on campus 
> >>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way 
> >>>>> to link the two systems efficiently.  Email below is from an ORNL 
> >>>>> participant in the CIFTS framework.
> >>>>>
> >>>>> -Pete 
> >>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March 
> >>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
> >>>>
> >>>> Ioan
> >>>>
> >>>> Rinku Gupta wrote:
> >>>>> We can meet at my office (D-231 in the MCS building) and then sneak 
> >>>>> into Pete's room, if it is empty.
> >>>>>
> >>>>> Rinku
> >>>>>
> >>>>>
> >>>>>
> >>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
> >>>>>
> >>>>>  
> >>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
> >>>>>> meeting in?
> >>>>>>
> >>>>>> Ioan
> >>>>>>
> >>>>>> Rinku Gupta wrote:
> >>>>>>
> >>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Rinku
> >>>>>>
> >>>>>>
> >>>>>> ----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
> >>>>>>
> >>>>>> Rinku, Ioan,
> >>>>>>
> >>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
> >>>>>>
> >>>>>> But if Rinku is just arriving back in the US that morning, it seems
> >>>>>> better to postpone to the week after.
> >>>>>>
> >>>>>> I can be at Argonne any time week of March 2. Mornings are free,
> >>>>>> Mon-Thu
> >>>>>> are best.
> >>>>>>
> >>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
> >>>>>>
> >>>>>> Hi Rinku,
> >>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we 
> >>>>>> need
> >>>>>>
> >>>>>> to meet the following week, I could meet Monday (March 2nd) and
> >>>>>> Thursday
> >>>>>>
> >>>>>> (March 5th) any time.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Ioan
> >>>>>>
> >>>>>> Rinku Gupta wrote:
> >>>>>>
> >>>>>> Hi Michael,  Ioan
> >>>>>>
> >>>>>> I am currently on travel and will arrive back to the USA only 
> >>>>>> Thursday
> >>>>>> (Feb 26th) early morning. Will you be available anytime the
> >>>>>> week after next? If not, then we can try to schedule a meeting
> >>>>>> sometime around 10:30/11pm next Thursday at ANL.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>> Rinku
> >>>>>>
> >>>>>>
> >>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
> >>>>>>
> >>>>>> Hi Rinku,
> >>>>>> I can meet next week on Wednesday any time, and Thursday morning
> >>>>>> before
> >>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> meet either at UC or ANL. Let me know what works best for everyone.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ioan
> >>>>>>
> >>>>>> Michael Wilde wrote:
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Wed
> >>>>>>
> >>>>>> of Thu, at Argonne or UChicago.
> >>>>>>
> >>>>>> Do either of those dates work for you, and which place is best?
> >>>>>>
> >>>>>> In the meantime I'll read up on CIFTS at
> >>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> that
> >>>>>>
> >>>>>> this refers to.
> >>>>>>
> >>>>>> If you have any other docs we should read, please send them.
> >>>>>>
> >>>>>> Thanks and regards,
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
> >>>>>>
> >>>>>> Ioan and Mike,
> >>>>>>
> >>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> backplane".  I'm the PI of the project which involved ORNL, LBL,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> IU,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Ohio State, and UTK.  Below is a suggestion to hook CIFTS to Falkon,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> so faults could be monitored.  Rinku (on the cc: line) is the lead
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> developer for CIFTS.  Maybe when one of you is on campus (ANL) you
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> can meet with Rinku, and brainstorm if there is any way to link the
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> two systems efficiently.  Email below is from an ORNL participant
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> the CIFTS framework.
> >>>>>>
> >>>>>> -Pete
> >>>>>>
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
> >>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
> >>>>>> tolerance in "many task computing"?
> >>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
> >>>>>>
> >>>>>> I recently read the SC08 paper on many task computing on which you're
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> a co-author. ( 
> >>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
> >>>>>> )
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I wonder if it would be viable to build a CIFTS demonstration 
> >>>>>> scenario
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> around the software system described in this paper?
> >>>>>>
> >>>>>> In the paper, there's a paragraph discussing reliability that
> >>>>>> discusses some of the issues at a high level.  It strikes me as
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> both
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> interesting and challenging because you have both system components
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> application tasks) interacting.
> >>>>>>
> >>>>>> It might also be worth looking at this environment to help understand
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> the use cases and requirements for the policy/control channels (as
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> opposed to the FTB's informational channel).
> >>>>>>
> >>>>>> Just some ideas, db
> >>>>>> -- 
> >>>>>> David E. Bernholdt                   |   Email: bernholdtde at ornl.gov
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Oak Ridge National Laboratory        |   Phone: +1 (865) 574 3147
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://www.csm.ornl.gov/~bernhold/ |   Fax:   +1 (865) 576 5491
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --~--~---------~--~----~------------~-------~--~----~
> >>>>>> You received this message because you are subscribed to the Google
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Groups "CIFTS" group.
> >>>>>> To post to this group, send email to cifts at googlegroups.com To
> >>>>>> unsubscribe from this group, send email to
> >>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
> >>>>>> at http://groups.google.com/group/cifts?hl=en
> >>>>>> -~----------~----~----~----~------~----~------~--~--- --
> >>>>>> ===================================================
> >>>>>> Ioan Raicu, Ph.D.
> >>>>>> ===================================================
> >>>>>> Distributed Systems Laboratory
> >>>>>> Computer Science Department
> >>>>>> University of Chicago
> >>>>>> 1100 E. 58th Street, Ryerson Hall
> >>>>>> Chicago, IL 60637
> >>>>>> ===================================================
> >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
> >>>>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>>>> ===================================================
> >>>>>> =================================================== --
> >>>>>> ===================================================
> >>>>>> Ioan Raicu, Ph.D.
> >>>>>> ===================================================
> >>>>>> Distributed Systems Laboratory
> >>>>>> Computer Science Department
> >>>>>> University of Chicago
> >>>>>> 1100 E. 58th Street, Ryerson Hall
> >>>>>> Chicago, IL 60637
> >>>>>> ===================================================
> >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
> >>>>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>>>> ===================================================
> >>>>>> ===================================================
> >>>>>> -- 
> >>>>>> ===================================================
> >>>>>> Ioan Raicu, Ph.D.
> >>>>>> ===================================================
> >>>>>> Distributed Systems Laboratory
> >>>>>> Computer Science Department
> >>>>>> University of Chicago
> >>>>>> 1100 E. 58th Street, Ryerson Hall
> >>>>>> Chicago, IL 60637
> >>>>>> ===================================================
> >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
> >>>>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>>>> ===================================================
> >>>>>> ===================================================
> >>>>>>     
> >>>>>   
> >>>>
> >>>> -- 
> >>>> ===================================================
> >>>> Ioan Raicu, Ph.D.
> >>>> ===================================================
> >>>> Distributed Systems Laboratory
> >>>> Computer Science Department
> >>>> University of Chicago
> >>>> 1100 E. 58th Street, Ryerson Hall
> >>>> Chicago, IL 60637
> >>>> ===================================================
> >>>> Email: iraicu at cs.uchicago.edu
> >>>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>> ===================================================
> >>>> ===================================================
> >>>
> >>
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list