[Swift-devel] Re: Fault tolerance in "many task computing"?

Michael Wilde wilde at mcs.anl.gov
Mon Mar 2 18:10:11 CST 2009



On 3/2/09 5:50 PM, Mihael Hategan wrote:
> Is there a Java library for FTB?

No, my understanding is that its only C at the moment.

> 
> What does FTB bring new to the table compared to a distributed messaging
> system?

Pete and Rinku (and a bit of reading) can certainly make a better case, 
but this is my general impression:

To me, it seems simple, lightweight, and well-structured for pub-sub of 
messages that pertain to system/application operation. I think it 
defines a nice model of endpoints, priorities, message codes, etc. while 
leaving a payload for the user to send message-specific details.

Its agents implement s spanning tree to route messages from distributed 
components, so the user doesnt need to worry about this. I think it has 
some redundancy in this delivery model.

It seems to be designed to be light weight to handle high traffic (eg 
from errant system components).

Just seems well-tailored to the log message routing job.

- Mike

> 
> Mihael
> 
> On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote:
>> All,
>>
>> Pete suggested we take a look at CIFTS's message logging system and 
>> consider integrating it into our stack. Rinku gave me, Allan, and Zhao 
>> and excellent overview and demo of the system. (Thanks, Rinku!)
>>
>> Here's my notes from this meeting. My intent is just to start a 
>> discussion for longer-term consideration, not any near-term action.
>> (Although Jing Tie may find some of these concepts fruitful for er 
>> troubleshooting research).
>>
>> CIFTS is the DOE SciDAC project "Coordinated and Improved Fault 
>> Tolerance for High Performance Computing Systems", PI'd by Pete:
>> http://www.mcs.anl.gov/research/cifts/index.php
>>
>> It produces "FTB", a backplane for distributing logging information 
>> within a distributed system:
>>
>> http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf
>>
>> I pointed Rinku to Swift and Falkon info, as well as Netlogger and 
>> activities related to it in the CEDPS project, and we have a joint 
>> action item to understand the possible overlap and integration issues 
>> and possibilities between these two systems.
>>
>> Netlogger and CEDPS info is at:
>>
>> http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
>> http://dev.globus.org/wiki/Incubator/NetLogger
>> http://www.cedps.net/index.php/Troubleshooting#Work-in-progress
>>
>> I mentioned that we have invested a small bit of effort in integrating 
>> Netlogger log publishing capabilities into Swift.
>>
>> Potential overlap notwithstanding, CIFTS (and in particular the Fault 
>> Tolerant Backplane, FTB), could serve as a very nice consolidation 
>> service for log information originating in the many different components 
>> involved in executing a Swift program:
>>
>> - the application program wrapper script
>> - the Falkon or Coaster worker agent
>> - the Globus job manager and/or local scheduler
>> - the worker node
>> - the remote site fileserver/filesystem
>> - a site system management facility like BG/P's RAS service
>> - Falkon and Coaster servers and bootstrappers
>> - the swift client-side engine
>> - GrifFTP and other transport protocols and services
>> - etc
>>
>> FTB would enable us to readily capture and consolidate all these 
>> information sources and funnel the data into streams related to specific 
>> Swift program executions. It has the infrastructure to route messages 
>> out of distributed systems, and to permit publication of and 
>> subscription to message streams. Its agents, it seems, can help messages 
>> traverse firewalls and deal with other transport and delivery issues.
>>
>> FTB is implemented as a C API, and comes with a set of example clients. 
>>  From this a simple set of command line interfaces could be derived to 
>> permit low-cost experimentation with the system in, eg, Falkon on the 
>> BG/P, where Rinku and others are implementing collectors to gather log 
>> information from different parts of ZeptoOS and the BG/P hardware complex.
>>
>> Its not clear that any of us have the cycles within the next two months 
>> to explore this, but it would make an interesting student project, to 
>> compare CIFTS and NetLogger, and to test some initial integrations into 
>> Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).
>>
>> My initial question is whether some CIFTS/FTB hooks could be planted in 
>> a lightweight Swift experiment, and we could try to get a feel for 
>> whether the infrastructure gives us something that we cant readily get 
>> today.  My gut feel is that is does.
>>
>> I think it would be a great research/development topic to explore how 
>> close this could bring us to the point where all distributed errors are 
>> cleanly routed back to the centralized user to more quickly pinpoint the 
>> cause of remote and distributed failures.  Swift does a *pretty* good 
>> job of this today, albeit in a somewhat ad-hoc fashion. FTB would make 
>> it easier to integrate information from additional sources like the 
>> remote scheduler and BGP RAS logs into the debugging process.
>>
>> And all that is before we even consider the goals of automating fault 
>> tolerance, which I think is the ultimate vision of CIFTS.
>>
>> Thoughts and discussion welcome. Once any of us get a day or so to play 
>> with FTB, we'll know more about the possibilities.
>>
>> Regards,
>>
>> Mike
>>
>>
>> On 3/1/09 11:11 AM, Ioan Raicu wrote:
>>> Hi Rinku,
>>> It looks like I am not going to be able to make the meeting tomorrow. On 
>>> Friday, another interview opportunity came up, and the only open slot 
>>> for the next 2 weeks was this Monday. Sorry about the short notice. Go 
>>> ahead and meet without me, and I'll catch up with what was discussed at 
>>> the meeting from Mike.
>>>
>>> Thanks,
>>> Ioan
>>>
>>> Michael Wilde wrote:
>>>> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2, 
>>>> or by phone.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> On 2/18/09 10:30 PM, Ian Foster wrote:
>>>>> Hi,
>>>>>
>>>>> This sounds like a really fun project. Maybe we should involve Zhao 
>>>>> and Allen as well, given that Ioan has (sadly) graduated, and will 
>>>>> leave us?
>>>>> I'd love to participate, I will need to do so by phone--could we do 
>>>>> that? I'll just listen in, and see what I can learn.
>>>>>
>>>>> Ian.
>>>>>
>>>>>
>>>>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
>>>>>
>>>>>> Great!
>>>>>>
>>>>>> I added Ian as a cc, maybe he wants to come to this meeting as well. 
>>>>>> Ian, the original message from Pete was:
>>>>>>> Ioan and Mike,
>>>>>>>
>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant 
>>>>>>> backplane".  I'm the PI of the project which involved ORNL, LBL, 
>>>>>>> IU, Ohio State, and UTK.  Below is a suggestion to hook CIFTS to 
>>>>>>> Falkon, so faults could be monitored.  Rinku (on the cc: line) is 
>>>>>>> the lead developer for CIFTS.  Maybe when one of you is on campus 
>>>>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way 
>>>>>>> to link the two systems efficiently.  Email below is from an ORNL 
>>>>>>> participant in the CIFTS framework.
>>>>>>>
>>>>>>> -Pete 
>>>>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March 
>>>>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>> We can meet at my office (D-231 in the MCS building) and then sneak 
>>>>>>> into Pete's room, if it is empty.
>>>>>>>
>>>>>>> Rinku
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
>>>>>>>
>>>>>>>  
>>>>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
>>>>>>>> meeting in?
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Rinku Gupta wrote:
>>>>>>>>
>>>>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rinku
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>> Rinku, Ioan,
>>>>>>>>
>>>>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
>>>>>>>>
>>>>>>>> But if Rinku is just arriving back in the US that morning, it seems
>>>>>>>> better to postpone to the week after.
>>>>>>>>
>>>>>>>> I can be at Argonne any time week of March 2. Mornings are free,
>>>>>>>> Mon-Thu
>>>>>>>> are best.
>>>>>>>>
>>>>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>> Hi Rinku,
>>>>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we 
>>>>>>>> need
>>>>>>>>
>>>>>>>> to meet the following week, I could meet Monday (March 2nd) and
>>>>>>>> Thursday
>>>>>>>>
>>>>>>>> (March 5th) any time.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Rinku Gupta wrote:
>>>>>>>>
>>>>>>>> Hi Michael,  Ioan
>>>>>>>>
>>>>>>>> I am currently on travel and will arrive back to the USA only 
>>>>>>>> Thursday
>>>>>>>> (Feb 26th) early morning. Will you be available anytime the
>>>>>>>> week after next? If not, then we can try to schedule a meeting
>>>>>>>> sometime around 10:30/11pm next Thursday at ANL.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rinku
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
>>>>>>>>
>>>>>>>> Hi Rinku,
>>>>>>>> I can meet next week on Wednesday any time, and Thursday morning
>>>>>>>> before
>>>>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> meet either at UC or ANL. Let me know what works best for everyone.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Michael Wilde wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Wed
>>>>>>>>
>>>>>>>> of Thu, at Argonne or UChicago.
>>>>>>>>
>>>>>>>> Do either of those dates work for you, and which place is best?
>>>>>>>>
>>>>>>>> In the meantime I'll read up on CIFTS at
>>>>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> that
>>>>>>>>
>>>>>>>> this refers to.
>>>>>>>>
>>>>>>>> If you have any other docs we should read, please send them.
>>>>>>>>
>>>>>>>> Thanks and regards,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
>>>>>>>>
>>>>>>>> Ioan and Mike,
>>>>>>>>
>>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> backplane".  I'm the PI of the project which involved ORNL, LBL,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> IU,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ohio State, and UTK.  Below is a suggestion to hook CIFTS to Falkon,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> so faults could be monitored.  Rinku (on the cc: line) is the lead
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> developer for CIFTS.  Maybe when one of you is on campus (ANL) you
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> can meet with Rinku, and brainstorm if there is any way to link the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> two systems efficiently.  Email below is from an ORNL participant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> the CIFTS framework.
>>>>>>>>
>>>>>>>> -Pete
>>>>>>>>
>>>>>>>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
>>>>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
>>>>>>>> tolerance in "many task computing"?
>>>>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
>>>>>>>>
>>>>>>>> I recently read the SC08 paper on many task computing on which you're
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> a co-author. ( 
>>>>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I wonder if it would be viable to build a CIFTS demonstration 
>>>>>>>> scenario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> around the software system described in this paper?
>>>>>>>>
>>>>>>>> In the paper, there's a paragraph discussing reliability that
>>>>>>>> discusses some of the issues at a high level.  It strikes me as
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> both
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> interesting and challenging because you have both system components
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> application tasks) interacting.
>>>>>>>>
>>>>>>>> It might also be worth looking at this environment to help understand
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> the use cases and requirements for the policy/control channels (as
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> opposed to the FTB's informational channel).
>>>>>>>>
>>>>>>>> Just some ideas, db
>>>>>>>> -- 
>>>>>>>> David E. Bernholdt                   |   Email: bernholdtde at ornl.gov
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Oak Ridge National Laboratory        |   Phone: +1 (865) 574 3147
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://www.csm.ornl.gov/~bernhold/ |   Fax:   +1 (865) 576 5491
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Groups "CIFTS" group.
>>>>>>>> To post to this group, send email to cifts at googlegroups.com To
>>>>>>>> unsubscribe from this group, send email to
>>>>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
>>>>>>>> at http://groups.google.com/group/cifts?hl=en
>>>>>>>> -~----------~----~----~----~------~----~------~--~--- --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> =================================================== --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> ===================================================
>>>>>>>> -- 
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> ===================================================
>>>>>>>>     
>>>>>>>   
>>>>>> -- 
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list