Topics in Distributed Systems – Autonomic Computing

 

Instructor: Matei Ripeanu 

TA: Samer Al-Kiswani

Schedule: Monday 5:00-8:00

Location: KAIS4018 (might change)

 

Announcements:

[14/01] Papers for next week selected. The deadline for review submission is Sunday 11:59pm.

[06/01] Subscribe to the class mailing list by emailing “sympa@ece.ubc.ca” with “subscribe eece571R@ece.ubc.ca” in the body of the message.

[06/01] Past project reports are here.  Some project ideas here.

 

Course description

As systems become more interconnected and diverse, architects and system administrators find it increasingly difficult to design interactions among components and anticipate their consequences. "The growing complexity of the IT infrastructure threatens to undermine the very benefits information technology aims to provide" [IBM’s Autonomic Computing Manifesto].

 

This course will focus on new ideas and developments that address one of the greatest challenges faced by the computing community: the design of complex, large-scale systems whose operation requires little to no human intervention other than to add/replace hardware or to change the system's service contract. 

 

The course will cover fundamental issues related to architecting and building self-managing, self-repairing, self-organizing, and/or self-configuring – in a word, autonomous – systems. Students will be exposed to the main architectural approaches to design systems with self * properties: emergence-, control-theory-, and component-based approaches, to issues related to operating system and middleware level support for autonomic properties, to fault-detection, diagnosis and control techniques for large-scale systems, as well as to the impact of emerging computing trends (e.g., cloud computing) trends on large-scale data processing system design. Advances in all these directions are key ingredients for recent efforts to build cyber infrastructure.

Course format

The course is structured to provide (i) an in-depth understanding of current topics in large-scale, distributed system research; (ii) experience with reviewing and presenting advanced technical material; (iii) exercising writing and critically reviewing research papers. The class workload has a participation component and a final project.

Participation. In each class we discuss two or more research papers. Read the papers before class (be an efficient reader!)  and write a review for each paper that includes the following:

1. State the main contribution of the paper

2. Critique the main contribution. 

a.        Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution). More importantly: Explain your rating in a sentence or two.

b.        Rate how convincing the methodology is. You may consider some of the following questions (use what is relevant): Do the claims and conclusions follow from the experiments? Are the assumptions realistic? Are the experiments well designed? Are there different experiments that would be more convincing? Are there other alternatives the authors should have considered? (And, of course, is the paper free of methodological errors?)

c.         What are the most important limitations of the approach?

3. What are the two strongest and/or most interesting ideas in the paper?

4. What are the two most striking weaknesses in the paper?

5. Name two questions that you would like to ask the authors.

6. Detail an interesting extension to the work not mentioned in the future work section.

7. Optional comments on the paper that you’d like to see discussed in class.

Reviews must be submitted by midnight the day before the class to the relevant Rotisserie Discussion on H2O. Papers are discussed in class. Discussions will be lead by one or more students and may include a brief (10-minute) presentation of the paper. Discussion leaders do not need to submit reviews, but they need to: (a) Prepare discussion plan, (b) Post a brief discussion summary on H2O based on in-class discussions (due before the following class).

Project: The final project is an opportunity for hands-on research in distributed systems. It involves literature survey, programming, running experiments or analytical modeling, analyzing results and writing a 10-page report. A list of project ideas is posted, but students are highly encouraged to propose topics of their own interest.  Teams of two students are highly recommended. Please see me if you want to form a larger team.

 

Schedule (tentative):

Last years’ course schedules can be found here: 2009-Massively Parallel/Distributed Computing Platforms; 2008-Quality of service; 2007-Data-intensive Computing Systems.

 

 

 

Topic / Project steps (tentative)

Papers / Other links

(tentative)

W1

01/04

Introduction. Autonomic computing vision. Overview of current research problems.  Course mechanics.

 

[Project: guidelines, requirements, defining success]

 

Slides: [s00, s01]

Suggested readings

         J. Kephart and D. Chess, The Vision of Autonomic Computing, IEEE Computer 36(1): 41-50 (2003).

         A. Ganek and T. Corbi, The Dawning of the autonomic computing era, IBM Systems Journal, 42(1):5-18, 2003.

         D. Clark, C. Partridge, J. Ramming, J. Wroclawski, A Knowledge Plane for the Internet, SIGCOMM, 2003.

         A. Brown, D. A. Patterson, Embracing Failure: A Case for Recovery-Oriented Computing, HPTPS 2001.

         R. Want, T. Peering, D. Tennenhouse, Comparing autonomic and proactive computing, IBM Systems Journal, 42(1):129-135, 2003.

W2

01/11

The administrator perspective. 

 

Slides: [s02]

Required

         Undo for operators: Building an undoable e-mail store, A. Brown and D. A. Patterson, USENIX  2003.

         K. Nagaraja, F. Oliveria, R. Bianchini, R. Martin, T. Nguyen, Understanding and Dealing with Operator Mistakes in Internet Services, OSDI 2004.

Optional

         R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC’04

         Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer?, HOTOS 2005.

         Ya­Yunn Su, Mona Attariyan, and Jason Flinn, AutoBash: Improving Configuration Management with Operating System Causality Analysis, SOSP’07.

         Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure, OSDI ’04

         D. Oppenheimer, A. Ganapathi, D. Patterson, Why do Internet services fail, and what can be done about it?, USITS’03

 

W3

01/18

Automatic performance diagnosis and tuning

(Steven)

[Project: 5-min project idea presentation, discussion of project themes.]

 

Slides: [s03]

Required

         M. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, E. Brewer, Path-Based Failure and Evolution Management, NSDI’04

Optional

         P. Barham, A. Donnelly, R. Isaacs, R. Mortier, Using Magpie for request extraction and workload modeling, OSDI’04

         M. Aguilera, J. Mogul, J. Wiener, P. Reynolds, A. Muthitacharoen, Performance debugging for distributed systems of black boxes.

 

W4

01/25

Automatic performance diagnosis and tuning 

(Steven & Lauro)

 

Slides: [s04]

Required

         B. Dageville, M. Zaït: SQL Memory Management in Oracle9i. VLDB 2002: 962-973.

         A.J. Storm, C. Garcia-Arellano, S. Lightstone, Y. Diao, M. Surendra: Adaptive Self-tuning Memory in DB2, VLDB 2006

Optional

         S. Chaudhuri, V. Narasayya, Self-tuning database systems: a decade of progress, VLDB’07

         K. Dias, M. Ramacher, U. Shaft, V. Venkataramani, G. Wood, Automatic Performance Diagnosis and Tuning in Oracle, CIDR’05

         S. Chaudhuri, V. Narasayya, An Efficient, Cost-Driven Index Selection Tool for Microsoft SQL Server, VLDB’97

         S. Agrawal, S. Chaudhuri, V. Narasayya, Automated Selection of Materialized Views and Indexes in SQL Databases, VLDB’00

 

W5

02/01

[Project: submit a two-page proposal by Sunday 01/31]

No papers to read – project discussion with each group.

W6

02/08

Models, approaches and architectures: control-theory based approaches,

(Lauro &  Keven)

Required

         Abdelzaher, K. Shin, N. Bhatti, Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach, IEEE TPDS, 13(1), Jan. 2002. [link]

         A. Arpaci-Dusseau et al, Information and Control in Gray-Box Systems, SOSP’01 [link]

Optional

         T. Abdelzaher, J. Stankovic, C. Lu, R.  Zhang, Y. Lu, Feedback Performance Control in Software Service, IEEE Control Systems Magazine, vol 23(3), 2003. [link]

         T. Horvath, T. Abdelzaher, K. Skadron, X. Liu, Dynamic Voltage Scaling in Multi-tier Web Servers with End-to-end Delay Control, IEEE Transactions on Computers,  56(4),  2007 [link]

 

         Midterm break

W7

03/01

[Project: midterm presentations, submit an up to five-page midterm report including related work. Deadline: Saturday 02/27]

 

W8

03/08

Models, approaches and architectures: emergence-based approaches,

 

(Keven)

 

Slides: [s05]

Required

         F. Dabek et al., Designing a DHT for low latency and high throughput, NSDI 04,

Optional

§     O. Babaoglu, M. Jelasity, A. Montresor, Grassroots Approach to Self-management in Large-Scale Distributed Systems, LNCS 3566, pp. 286-296, 2005.

§     J. Li, J. Stribling, R. Morris, F. Kaashoek: Bandwidth-efficient Management of DHT Routing,  NSDI 2005

§     Ion Stoica et al., Chord: a scalable peer-to-peer lookup protocol for internet applications, IEEE/ACM Trans. Netw. 11(1): 17-32 (2003)

§     K. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker, I. Stoica: The impact of DHT routing geometry on resilience and proximity, SIGCOMM’03.

 

W9

03/15

System-level issues: Cloud applications/ storage systems /

 

(Annie & MohammedS)

Required

         Peter Bodik, Rean Griffith, Charles Sutton, Armando Fox, Michael Jordan, David Patterson,  Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters, HotCloud’09

         Peter Bodík, Rean Griffith, Charles Sutton, Armando Fox, Michael I. Jordan, David A. Patterson, Automatic Exploration of Datacenter Performance Regimes, ACDC’09.

         H. Lim, S. Babu, J. Chase, S. Parekh, Automated Control in Cloud Computing: Challenges and Opportunities, ACDC’2009

         Michael Abd-El-Malek et al., Ursa Minor: versatile cluster-based storage, FAST’05

Optional

         ACDC’09 and HotCloud’09 workshops;

         G. Alvarez, E. Borowsky, S. Go, T. Romer. R. Becker-Szendy, R. Golding, A. Merchant, M. Spasojevic, A. Veitch, J. Wilkes, MINERVA: An Automated Resource Provisioning Tool for Large-Scale Storage Systems, ACM TOCS, 19(4):483-518, Nov. 2001.

         Y. Saito, S. Frølund, A. Veitch, A. Merchant, S. Spence: FAB: building distributed enterprise disk arrays from commodity components. ASPLOS 2004

         S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, The Case for a Versatile Storage System, HotStorage’09.

 

W10

03/22

No class – travel

 

W11

03/29

System-level issues: storage systems / operating systems

 (Annie)

Required

         Mesnier et al, File Classification in Self-* Storage Systems, ICAC’04.

Optional

         Self* storage project [link]

 

W12

04/05

UBC closed

 

W13

04/12

Protection and security issues

 (MohammadA)

Required

         M. Costa, J. Crowcroft, M. Castro, A. Rowstron, L. Zhou, L. Zhang, P. Barham, Vigilante: End-to-end Containment of Internet Worms, SOSP, 2005.

         Yi-Min Wang, Ming Ma, Strider Search Ranger: Towards an Autonomic Anti-Spam Search Engine, , ICAC’07

Optional

         D. Chess, C. Palmer, S. White, Security in an autonomic computing environment, IBM Systems Journal, 2003.

 

W14

04/19

Nature-inspired systems /

/ Real-world experience

 

(Emalayan)

Required

         F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou, Rx: Treating Bugs as Allergies - A safe method to survive software failures, In Proc. SOSP, 2005.

         P. Snyder, R. Greenstadt, G. Valetto. Myconet: A Fungi-inspired Model for Superpeer-based Peer-to-Peer Overlay Topologies. SASO’09.

Optional

         ACMS: The Akamai Configuration Management System, NSDI ’05,

         I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, J. Chase, Correlating instrumentation data to system states: A building block for automated diagnosis and control, OSDI, 2004.

         P. Reynolds, C. Killian, J. Wiener, J. Mogul, M. Shah, A. Vahdat, Pip: Detecting the Unexpected in Distributed Systems, NSDI 2006.

         S. Hofmeyr and S. Forrest, Architecture for an Artificial Immune System, Evolutionary Computation 7, No. 1, 1289-1296 2000.

         D. Knoester, P. McKinley. Evolution of Probabilistic Consensus in Digital Organisms, SASO’09

W15

04/26

Course summary and wrap-up.

 

[Project: presentations and wrap-up]

 

 

 

Other links:

1.     Autonomic Computing @ IBM: http://www.research.ibm.com/autonomic/

2.     Recovery-oriented Computing: http://roc.cs.berkeley.edu/

3.     ICAC: 09, 08, 07, 06, 05, 04

4.     SASO: 09, 08, 07