[Swift-devel] Re: Fault tolerance in "many task computing"?

Michael Wilde wilde at mcs.anl.gov
Mon Mar 2 17:26:49 CST 2009


All,

Pete suggested we take a look at CIFTS's message logging system and 
consider integrating it into our stack. Rinku gave me, Allan, and Zhao 
and excellent overview and demo of the system. (Thanks, Rinku!)

Here's my notes from this meeting. My intent is just to start a 
discussion for longer-term consideration, not any near-term action.
(Although Jing Tie may find some of these concepts fruitful for er 
troubleshooting research).

CIFTS is the DOE SciDAC project "Coordinated and Improved Fault 
Tolerance for High Performance Computing Systems", PI'd by Pete:
http://www.mcs.anl.gov/research/cifts/index.php

It produces "FTB", a backplane for distributing logging information 
within a distributed system:

http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf

I pointed Rinku to Swift and Falkon info, as well as Netlogger and 
activities related to it in the CEDPS project, and we have a joint 
action item to understand the possible overlap and integration issues 
and possibilities between these two systems.

Netlogger and CEDPS info is at:

http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
http://dev.globus.org/wiki/Incubator/NetLogger
http://www.cedps.net/index.php/Troubleshooting#Work-in-progress

I mentioned that we have invested a small bit of effort in integrating 
Netlogger log publishing capabilities into Swift.

Potential overlap notwithstanding, CIFTS (and in particular the Fault 
Tolerant Backplane, FTB), could serve as a very nice consolidation 
service for log information originating in the many different components 
involved in executing a Swift program:

- the application program wrapper script
- the Falkon or Coaster worker agent
- the Globus job manager and/or local scheduler
- the worker node
- the remote site fileserver/filesystem
- a site system management facility like BG/P's RAS service
- Falkon and Coaster servers and bootstrappers
- the swift client-side engine
- GrifFTP and other transport protocols and services
- etc

FTB would enable us to readily capture and consolidate all these 
information sources and funnel the data into streams related to specific 
Swift program executions. It has the infrastructure to route messages 
out of distributed systems, and to permit publication of and 
subscription to message streams. Its agents, it seems, can help messages 
traverse firewalls and deal with other transport and delivery issues.

FTB is implemented as a C API, and comes with a set of example clients. 
 From this a simple set of command line interfaces could be derived to 
permit low-cost experimentation with the system in, eg, Falkon on the 
BG/P, where Rinku and others are implementing collectors to gather log 
information from different parts of ZeptoOS and the BG/P hardware complex.

Its not clear that any of us have the cycles within the next two months 
to explore this, but it would make an interesting student project, to 
compare CIFTS and NetLogger, and to test some initial integrations into 
Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).

My initial question is whether some CIFTS/FTB hooks could be planted in 
a lightweight Swift experiment, and we could try to get a feel for 
whether the infrastructure gives us something that we cant readily get 
today.  My gut feel is that is does.

I think it would be a great research/development topic to explore how 
close this could bring us to the point where all distributed errors are 
cleanly routed back to the centralized user to more quickly pinpoint the 
cause of remote and distributed failures.  Swift does a *pretty* good 
job of this today, albeit in a somewhat ad-hoc fashion. FTB would make 
it easier to integrate information from additional sources like the 
remote scheduler and BGP RAS logs into the debugging process.

And all that is before we even consider the goals of automating fault 
tolerance, which I think is the ultimate vision of CIFTS.

Thoughts and discussion welcome. Once any of us get a day or so to play 
with FTB, we'll know more about the possibilities.

Regards,

Mike


On 3/1/09 11:11 AM, Ioan Raicu wrote:
> Hi Rinku,
> It looks like I am not going to be able to make the meeting tomorrow. On 
> Friday, another interview opportunity came up, and the only open slot 
> for the next 2 weeks was this Monday. Sorry about the short notice. Go 
> ahead and meet without me, and I'll catch up with what was discussed at 
> the meeting from Mike.
> 
> Thanks,
> Ioan
> 
> Michael Wilde wrote:
>> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2, 
>> or by phone.
>>
>> - Mike
>>
>>
>> On 2/18/09 10:30 PM, Ian Foster wrote:
>>> Hi,
>>>
>>> This sounds like a really fun project. Maybe we should involve Zhao 
>>> and Allen as well, given that Ioan has (sadly) graduated, and will 
>>> leave us?
>>> I'd love to participate, I will need to do so by phone--could we do 
>>> that? I'll just listen in, and see what I can learn.
>>>
>>> Ian.
>>>
>>>
>>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
>>>
>>>> Great!
>>>>
>>>> I added Ian as a cc, maybe he wants to come to this meeting as well. 
>>>> Ian, the original message from Pete was:
>>>>> Ioan and Mike,
>>>>>
>>>>> The CIFTS project is a DOE project to provide a "fault tolerant 
>>>>> backplane".  I'm the PI of the project which involved ORNL, LBL, 
>>>>> IU, Ohio State, and UTK.  Below is a suggestion to hook CIFTS to 
>>>>> Falkon, so faults could be monitored.  Rinku (on the cc: line) is 
>>>>> the lead developer for CIFTS.  Maybe when one of you is on campus 
>>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way 
>>>>> to link the two systems efficiently.  Email below is from an ORNL 
>>>>> participant in the CIFTS framework.
>>>>>
>>>>> -Pete 
>>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March 
>>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
>>>>
>>>> Ioan
>>>>
>>>> Rinku Gupta wrote:
>>>>> We can meet at my office (D-231 in the MCS building) and then sneak 
>>>>> into Pete's room, if it is empty.
>>>>>
>>>>> Rinku
>>>>>
>>>>>
>>>>>
>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
>>>>>
>>>>>  
>>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
>>>>>> meeting in?
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>
>>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
>>>>>>
>>>>>> Thanks
>>>>>> Rinku
>>>>>>
>>>>>>
>>>>>> ----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
>>>>>>
>>>>>> Rinku, Ioan,
>>>>>>
>>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
>>>>>>
>>>>>> But if Rinku is just arriving back in the US that morning, it seems
>>>>>> better to postpone to the week after.
>>>>>>
>>>>>> I can be at Argonne any time week of March 2. Mornings are free,
>>>>>> Mon-Thu
>>>>>> are best.
>>>>>>
>>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
>>>>>>
>>>>>> Hi Rinku,
>>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we 
>>>>>> need
>>>>>>
>>>>>> to meet the following week, I could meet Monday (March 2nd) and
>>>>>> Thursday
>>>>>>
>>>>>> (March 5th) any time.
>>>>>>
>>>>>> Cheers,
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>
>>>>>> Hi Michael,  Ioan
>>>>>>
>>>>>> I am currently on travel and will arrive back to the USA only 
>>>>>> Thursday
>>>>>> (Feb 26th) early morning. Will you be available anytime the
>>>>>> week after next? If not, then we can try to schedule a meeting
>>>>>> sometime around 10:30/11pm next Thursday at ANL.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Rinku
>>>>>>
>>>>>>
>>>>>> ----- "Ioan Raicu" <iraicu at cs.uchicago.edu> wrote:
>>>>>>
>>>>>> Hi Rinku,
>>>>>> I can meet next week on Wednesday any time, and Thursday morning
>>>>>> before
>>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> meet either at UC or ANL. Let me know what works best for everyone.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ioan
>>>>>>
>>>>>> Michael Wilde wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Wed
>>>>>>
>>>>>> of Thu, at Argonne or UChicago.
>>>>>>
>>>>>> Do either of those dates work for you, and which place is best?
>>>>>>
>>>>>> In the meantime I'll read up on CIFTS at
>>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> that
>>>>>>
>>>>>> this refers to.
>>>>>>
>>>>>> If you have any other docs we should read, please send them.
>>>>>>
>>>>>> Thanks and regards,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
>>>>>>
>>>>>> Ioan and Mike,
>>>>>>
>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> backplane".  I'm the PI of the project which involved ORNL, LBL,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> IU,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ohio State, and UTK.  Below is a suggestion to hook CIFTS to Falkon,
>>>>>>
>>>>>>
>>>>>>
>>>>>> so faults could be monitored.  Rinku (on the cc: line) is the lead
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> developer for CIFTS.  Maybe when one of you is on campus (ANL) you
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> can meet with Rinku, and brainstorm if there is any way to link the
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> two systems efficiently.  Email below is from an ORNL participant
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> in
>>>>>>
>>>>>>
>>>>>>
>>>>>> the CIFTS framework.
>>>>>>
>>>>>> -Pete
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
>>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
>>>>>> tolerance in "many task computing"?
>>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
>>>>>>
>>>>>> I recently read the SC08 paper on many task computing on which you're
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> a co-author. ( 
>>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
>>>>>> )
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I wonder if it would be viable to build a CIFTS demonstration 
>>>>>> scenario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> around the software system described in this paper?
>>>>>>
>>>>>> In the paper, there's a paragraph discussing reliability that
>>>>>> discusses some of the issues at a high level.  It strikes me as
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> both
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> interesting and challenging because you have both system components
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> application tasks) interacting.
>>>>>>
>>>>>> It might also be worth looking at this environment to help understand
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> the use cases and requirements for the policy/control channels (as
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> opposed to the FTB's informational channel).
>>>>>>
>>>>>> Just some ideas, db
>>>>>> -- 
>>>>>> David E. Bernholdt                   |   Email: bernholdtde at ornl.gov
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Oak Ridge National Laboratory        |   Phone: +1 (865) 574 3147
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://www.csm.ornl.gov/~bernhold/ |   Fax:   +1 (865) 576 5491
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>> You received this message because you are subscribed to the Google
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Groups "CIFTS" group.
>>>>>> To post to this group, send email to cifts at googlegroups.com To
>>>>>> unsubscribe from this group, send email to
>>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
>>>>>> at http://groups.google.com/group/cifts?hl=en
>>>>>> -~----------~----~----~----~------~----~------~--~--- --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> =================================================== --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>>>>>> -- 
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>>>>>>     
>>>>>   
>>>>
>>>> -- 
>>>> ===================================================
>>>> Ioan Raicu, Ph.D.
>>>> ===================================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ===================================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>> ===================================================
>>>> ===================================================
>>>
>>
> 



More information about the Swift-devel mailing list