[Swift-devel] Re: Fault tolerance in "many task computing"?
Michael Wilde
wilde at mcs.anl.gov
Mon Mar 2 18:10:11 CST 2009
On 3/2/09 5:50 PM, Mihael Hategan wrote:
> Is there a Java library for FTB?
No, my understanding is that its only C at the moment.
>
> What does FTB bring new to the table compared to a distributed messaging
> system?
Pete and Rinku (and a bit of reading) can certainly make a better case,
but this is my general impression:
To me, it seems simple, lightweight, and well-structured for pub-sub of
messages that pertain to system/application operation. I think it
defines a nice model of endpoints, priorities, message codes, etc. while
leaving a payload for the user to send message-specific details.
Its agents implement s spanning tree to route messages from distributed
components, so the user doesnt need to worry about this. I think it has
some redundancy in this delivery model.
It seems to be designed to be light weight to handle high traffic (eg
from errant system components).
Just seems well-tailored to the log message routing job.
- Mike
>
> Mihael
>
> On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote:
>> All,
>>
>> Pete suggested we take a look at CIFTS's message logging system and
>> consider integrating it into our stack. Rinku gave me, Allan, and Zhao
>> and excellent overview and demo of the system. (Thanks, Rinku!)
>>
>> Here's my notes from this meeting. My intent is just to start a
>> discussion for longer-term consideration, not any near-term action.
>> (Although Jing Tie may find some of these concepts fruitful for er
>> troubleshooting research).
>>
>> CIFTS is the DOE SciDAC project "Coordinated and Improved Fault
>> Tolerance for High Performance Computing Systems", PI'd by Pete:
>> http://www.mcs.anl.gov/research/cifts/index.php
>>
>> It produces "FTB", a backplane for distributing logging information
>> within a distributed system:
>>
>> http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf
>>
>> I pointed Rinku to Swift and Falkon info, as well as Netlogger and
>> activities related to it in the CEDPS project, and we have a joint
>> action item to understand the possible overlap and integration issues
>> and possibilities between these two systems.
>>
>> Netlogger and CEDPS info is at:
>>
>> http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
>> http://dev.globus.org/wiki/Incubator/NetLogger
>> http://www.cedps.net/index.php/Troubleshooting#Work-in-progress
>>
>> I mentioned that we have invested a small bit of effort in integrating
>> Netlogger log publishing capabilities into Swift.
>>
>> Potential overlap notwithstanding, CIFTS (and in particular the Fault
>> Tolerant Backplane, FTB), could serve as a very nice consolidation
>> service for log information originating in the many different components
>> involved in executing a Swift program:
>>
>> - the application program wrapper script
>> - the Falkon or Coaster worker agent
>> - the Globus job manager and/or local scheduler
>> - the worker node
>> - the remote site fileserver/filesystem
>> - a site system management facility like BG/P's RAS service
>> - Falkon and Coaster servers and bootstrappers
>> - the swift client-side engine
>> - GrifFTP and other transport protocols and services
>> - etc
>>
>> FTB would enable us to readily capture and consolidate all these
>> information sources and funnel the data into streams related to specific
>> Swift program executions. It has the infrastructure to route messages
>> out of distributed systems, and to permit publication of and
>> subscription to message streams. Its agents, it seems, can help messages
>> traverse firewalls and deal with other transport and delivery issues.
>>
>> FTB is implemented as a C API, and comes with a set of example clients.
>> From this a simple set of command line interfaces could be derived to
>> permit low-cost experimentation with the system in, eg, Falkon on the
>> BG/P, where Rinku and others are implementing collectors to gather log
>> information from different parts of ZeptoOS and the BG/P hardware complex.
>>
>> Its not clear that any of us have the cycles within the next two months
>> to explore this, but it would make an interesting student project, to
>> compare CIFTS and NetLogger, and to test some initial integrations into
>> Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).
>>
>> My initial question is whether some CIFTS/FTB hooks could be planted in
>> a lightweight Swift experiment, and we could try to get a feel for
>> whether the infrastructure gives us something that we cant readily get
>> today. My gut feel is that is does.
>>
>> I think it would be a great research/development topic to explore how
>> close this could bring us to the point where all distributed errors are
>> cleanly routed back to the centralized user to more quickly pinpoint the
>> cause of remote and distributed failures. Swift does a *pretty* good
>> job of this today, albeit in a somewhat ad-hoc fashion. FTB would make
>> it easier to integrate information from additional sources like the
>> remote scheduler and BGP RAS logs into the debugging process.
>>
>> And all that is before we even consider the goals of automating fault
>> tolerance, which I think is the ultimate vision of CIFTS.
>>
>> Thoughts and discussion welcome. Once any of us get a day or so to play
>> with FTB, we'll know more about the possibilities.
>>
>> Regards,
>>
>> Mike
>>
>>
>> On 3/1/09 11:11 AM, Ioan Raicu wrote:
>>> Hi Rinku,
>>> It looks like I am not going to be able to make the meeting tomorrow. On
>>> Friday, another interview opportunity came up, and the only open slot
>>> for the next 2 weeks was this Monday. Sorry about the short notice. Go
>>> ahead and meet without me, and I'll catch up with what was discussed at
>>> the meeting from Mike.
>>>
>>> Thanks,
>>> Ioan
>>>
>>> Michael Wilde wrote:
>>>> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2,
>>>> or by phone.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> On 2/18/09 10:30 PM, Ian Foster wrote:
>>>>> Hi,
>>>>>
>>>>> This sounds like a really fun project. Maybe we should involve Zhao
>>>>> and Allen as well, given that Ioan has (sadly) graduated, and will
>>>>> leave us?
>>>>> I'd love to participate, I will need to do so by phone--could we do
>>>>> that? I'll just listen in, and see what I can learn.
>>>>>
>>>>> Ian.
>>>>>
>>>>>
>>>>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
>>>>>
>>>>>> Great!
>>>>>>
>>>>>> I added Ian as a cc, maybe he wants to come to this meeting as well.
>>>>>> Ian, the original message from Pete was:
>>>>>>> Ioan and Mike,
>>>>>>>
>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
>>>>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to
>>>>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is
>>>>>>> the lead developer for CIFTS. Maybe when one of you is on campus
>>>>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way
>>>>>>> to link the two systems efficiently. Email below is from an ORNL
>>>>>>> participant in the CIFTS framework.
>>>>>>>
>>>>>>> -Pete
>>>>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March
>>>>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>> We can meet at my office (D-231 in the MCS building) and then sneak
>>>>>>> into Pete's room, if it is empty.
>>>>>>>
>>>>>>> Rinku
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
>>>>>>>> meeting in?
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Rinku Gupta wrote:
>>>>>>>>
>>>>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rinku
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>> Rinku, Ioan,
>>>>>>>>
>>>>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
>>>>>>>>
>>>>>>>> But if Rinku is just arriving back in the US that morning, it seems
>>>>>>>> better to postpone to the week after.
>>>>>>>>
>>>>>>>> I can be at Argonne any time week of March 2. Mornings are free,
>>>>>>>> Mon-Thu
>>>>>>>> are best.
>>>>>>>>
>>>>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>> Hi Rinku,
>>>>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we
>>>>>>>> need
>>>>>>>>
>>>>>>>> to meet the following week, I could meet Monday (March 2nd) and
>>>>>>>> Thursday
>>>>>>>>
>>>>>>>> (March 5th) any time.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Rinku Gupta wrote:
>>>>>>>>
>>>>>>>> Hi Michael, Ioan
>>>>>>>>
>>>>>>>> I am currently on travel and will arrive back to the USA only
>>>>>>>> Thursday
>>>>>>>> (Feb 26th) early morning. Will you be available anytime the
>>>>>>>> week after next? If not, then we can try to schedule a meeting
>>>>>>>> sometime around 10:30/11pm next Thursday at ANL.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rinku
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
>>>>>>>>
>>>>>>>> Hi Rinku,
>>>>>>>> I can meet next week on Wednesday any time, and Thursday morning
>>>>>>>> before
>>>>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> meet either at UC or ANL. Let me know what works best for everyone.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Michael Wilde wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Wed
>>>>>>>>
>>>>>>>> of Thu, at Argonne or UChicago.
>>>>>>>>
>>>>>>>> Do either of those dates work for you, and which place is best?
>>>>>>>>
>>>>>>>> In the meantime I'll read up on CIFTS at
>>>>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> that
>>>>>>>>
>>>>>>>> this refers to.
>>>>>>>>
>>>>>>>> If you have any other docs we should read, please send them.
>>>>>>>>
>>>>>>>> Thanks and regards,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
>>>>>>>>
>>>>>>>> Ioan and Mike,
>>>>>>>>
>>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> IU,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> can meet with Rinku, and brainstorm if there is any way to link the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> two systems efficiently. Email below is from an ORNL participant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> the CIFTS framework.
>>>>>>>>
>>>>>>>> -Pete
>>>>>>>>
>>>>>>>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
>>>>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
>>>>>>>> tolerance in "many task computing"?
>>>>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
>>>>>>>>
>>>>>>>> I recently read the SC08 paper on many task computing on which you're
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> a co-author. (
>>>>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I wonder if it would be viable to build a CIFTS demonstration
>>>>>>>> scenario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> around the software system described in this paper?
>>>>>>>>
>>>>>>>> In the paper, there's a paragraph discussing reliability that
>>>>>>>> discusses some of the issues at a high level. It strikes me as
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> both
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> interesting and challenging because you have both system components
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> application tasks) interacting.
>>>>>>>>
>>>>>>>> It might also be worth looking at this environment to help understand
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> the use cases and requirements for the policy/control channels (as
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> opposed to the FTB's informational channel).
>>>>>>>>
>>>>>>>> Just some ideas, db
>>>>>>>> --
>>>>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Groups "CIFTS" group.
>>>>>>>> To post to this group, send email to cifts at googlegroups.com To
>>>>>>>> unsubscribe from this group, send email to
>>>>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
>>>>>>>> at http://groups.google.com/group/cifts?hl=en
>>>>>>>> -~----------~----~----~----~------~----~------~--~--- --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> =================================================== --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> ===================================================
>>>>>>>> --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> ===================================================
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>> Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list