Topics in
Distributed Systems – Autonomic Computing
Instructor: Matei Ripeanu
TA: Samer Al-Kiswani
Schedule: Monday 5:00-8:00
Location: KAIS4018 (might change)
Announcements:
[14/01] Papers for next week selected. The deadline for review
submission is Sunday 11:59pm.
[06/01] Subscribe to the class
mailing list by emailing “
[06/01] Past project reports
are here.
Some project ideas here.
As systems become more interconnected and diverse,
architects and system administrators find it increasingly difficult to design
interactions among components and anticipate their consequences. "The growing
complexity of the IT infrastructure threatens to undermine the very benefits
information technology aims to provide" [IBM’s Autonomic Computing
Manifesto].
This course will focus on new ideas and developments that
address one of the greatest challenges faced by the computing community: the
design of complex, large-scale systems whose operation requires little to no
human intervention other than to add/replace hardware or to change the system's
service contract.
The course will cover fundamental issues related to
architecting and building self-managing, self-repairing, self-organizing,
and/or self-configuring – in a word, autonomous – systems. Students will be
exposed to the main architectural approaches to design systems with self *
properties: emergence-, control-theory-, and component-based approaches, to
issues related to operating system and middleware level support for autonomic
properties, to fault-detection, diagnosis and control techniques for
large-scale systems, as well as to the impact of emerging computing trends
(e.g., cloud computing) trends on large-scale data processing system design.
Advances in all these directions are key ingredients for recent efforts to
build cyber infrastructure.
The course is structured to provide (i)
an in-depth understanding of current topics in large-scale, distributed system
research; (ii) experience with reviewing and presenting advanced technical
material; (iii) exercising writing and critically reviewing research papers.
The class workload has a participation
component and a final project.
1. State the main
contribution of the paper
2. Critique the main
contribution.
a.
Rate the significance of the paper on a scale of 5
(breakthrough), 4 (significant contribution), 3 (modest contribution), 2
(incremental contribution), 1 (no contribution or negative contribution). More
importantly: Explain your rating in a sentence or two.
b.
Rate how convincing the methodology is. You may
consider some of the following questions (use what is relevant): Do the claims
and conclusions follow from the experiments? Are the assumptions realistic? Are
the experiments well designed? Are there different experiments that would be
more convincing? Are there other alternatives the authors should have
considered? (And, of course, is the paper free of methodological errors?)
c.
What are the most important limitations of the approach?
3. What are the two
strongest and/or most interesting ideas in the paper?
4. What are the two most
striking weaknesses in the paper?
5. Name two questions
that you would like to ask the authors.
6. Detail an interesting
extension to the work not mentioned in the future work section.
7. Optional comments on
the paper that you’d like to see discussed in class.
Reviews
must be submitted by midnight the day before the class to the
relevant Rotisserie Discussion on H2O. Papers
are discussed in class. Discussions will be lead by one or more students and
may include a brief (10-minute) presentation of the paper. Discussion leaders do not need to submit reviews, but they
need to: (a) Prepare discussion plan, (b) Post a brief discussion summary
on H2O based on in-class discussions (due before the following class).
Schedule (tentative):
Last years’ course
schedules can be found here: 2009-Massively Parallel/Distributed
Computing Platforms; 2008-Quality
of service; 2007-Data-intensive
Computing Systems.
|
|
Topic / Project
steps (tentative) |
Papers / Other
links (tentative) |
W1 |
01/04 |
Introduction. Autonomic computing vision. Overview of
current research problems. Course
mechanics. [Project: guidelines, requirements,
defining success] |
Suggested
readings •
J. Kephart and D. Chess, The Vision of Autonomic Computing, IEEE
Computer 36(1): 41-50 (2003). •
A. Ganek and T. Corbi, The
Dawning of the autonomic computing era, IBM Systems Journal,
42(1):5-18, 2003. •
D. Clark, C. Partridge, J. Ramming, J. Wroclawski, A Knowledge Plane for the
Internet, SIGCOMM, 2003. •
A. Brown, D. A. Patterson, Embracing
Failure: A Case for Recovery-Oriented Computing, HPTPS 2001. •
R. Want, T. Peering, D. Tennenhouse,
Comparing autonomic and
proactive computing, IBM Systems Journal, 42(1):129-135, 2003. |
W2 |
01/11 |
The administrator
perspective. Slides: [s02] |
Required •
Undo
for operators: Building an undoable e-mail store, A.
Brown and D. A. Patterson, USENIX 2003. •
K. Nagaraja, F. Oliveria, R. Bianchini, R.
Martin, T. Nguyen, Understanding
and Dealing with Operator Mistakes in Internet Services, OSDI 2004. Optional •
R. Barret, P. Maglio, E. Kandogan, J. Bailey,
Usable
Autonomic Computing Systems: the Administrators' Perspective, ICAC’04 •
Brown and J. Hellerstein, Reducing
the Cost of IT Operations - Is Automation Always the Answer?, HOTOS
2005. •
YaYunn Su, Mona Attariyan,
and Jason Flinn, AutoBash:
Improving Configuration Management with Operating System Causality Analysis,
SOSP’07. •
Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure,
OSDI ’04 •
D. Oppenheimer, A. Ganapathi,
D. Patterson, Why do
Internet services fail, and what can be done about it?, USITS’03 |
W3 |
01/18 |
Automatic performance diagnosis and tuning (Steven) [Project: 5-min project idea presentation,
discussion of project themes.] Slides: [s03] |
Required •
M. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, E. Brewer, Path-Based Failure
and Evolution Management, NSDI’04 Optional •
P. Barham, A. Donnelly, R.
Isaacs, R. Mortier, Using Magpie for
request extraction and workload modeling, OSDI’04 •
M. Aguilera, J. Mogul, J. Wiener, P. Reynolds, A. Muthitacharoen, Performance debugging for
distributed systems of black boxes. |
W4 |
01/25 |
Automatic performance diagnosis and tuning (Steven & Lauro) Slides: [s04] |
Required •
B. Dageville, M. Zaït: SQL Memory
Management in Oracle9i. VLDB 2002: 962-973. •
A.J. Storm, C. Garcia-Arellano, S. Lightstone,
Y. Diao, M. Surendra: Adaptive
Self-tuning Memory in DB2, VLDB 2006 Optional •
S. Chaudhuri, V. Narasayya, Self-tuning
database systems: a decade of progress, VLDB’07 •
K. Dias, M. Ramacher, U.
Shaft, V. Venkataramani, G. Wood, Automatic Performance
Diagnosis and Tuning in Oracle, CIDR’05 •
•
S. Agrawal, |
W5 |
02/01 |
[Project: submit a two-page proposal by
Sunday 01/31] |
No papers to read –
project discussion with each group. |
W6 |
02/08 |
Models, approaches and
architectures: control-theory based approaches, (Lauro & Keven) |
Required •
Abdelzaher, K. Shin, •
A. Arpaci-Dusseau et al, Information and Control in Gray-Box Systems, SOSP’01 [link] Optional •
T. Abdelzaher, J. Stankovic, C. Lu, R. Zhang,
Y. Lu, Feedback Performance
Control in Software Service, IEEE
Control Systems Magazine, vol 23(3), 2003. [link] •
T.
Horvath, T. Abdelzaher, K. Skadron,
X. Liu, Dynamic
Voltage Scaling in Multi-tier Web Servers with End-to-end Delay Control, IEEE
Transactions on Computers, 56(4), 2007 [link] |
•
Midterm break |
|||
W7 |
03/01 |
[Project: midterm presentations, submit
an up to five-page midterm report including related work. Deadline: Saturday
02/27] |
|
W8 |
03/08 |
Models, approaches and
architectures: emergence-based
approaches, (Keven) Slides: [s05] |
Required •
F. Dabek et al., Designing a DHT
for low latency and high throughput, NSDI 04, Optional
§ O. Babaoglu, M. Jelasity, A. Montresor, Grassroots Approach to Self-management
in Large-Scale Distributed Systems, LNCS 3566, pp. 286-296, 2005. §
J. Li, J. Stribling, R.
Morris, F. Kaashoek: Bandwidth-efficient
Management of DHT Routing,
NSDI 2005 § Ion Stoica et al., Chord: a scalable
peer-to-peer lookup protocol for internet applications, IEEE/ACM
Trans. Netw. 11(1): 17-32 (2003) § K. Gummadi, R. Gummadi, S. D.
Gribble, S. Ratnasamy, S. Shenker,
I. Stoica: The impact of
DHT routing geometry on resilience and proximity, SIGCOMM’03. |
W9 |
03/15 |
System-level issues: Cloud applications/ storage systems
/ (Annie & MohammedS) |
Required •
Peter Bodik, Rean Griffith, Charles Sutton, Armando Fox, Michael
Jordan, David Patterson, Statistical
Machine Learning Makes Automatic Control Practical for Internet Datacenters,
HotCloud’09 •
Peter Bodík, Rean Griffith, Charles Sutton, Armando Fox, Michael I.
Jordan, David A. Patterson, Automatic
Exploration of Datacenter Performance Regimes, ACDC’09. •
H. Lim, S. Babu, J. Chase, •
Michael Abd-El-Malek et al., Ursa Minor: versatile cluster-based storage,
FAST’05 Optional
•
ACDC’09
and HotCloud’09
workshops; •
G. Alvarez, E. Borowsky, S.
Go, T. Romer. R. Becker-Szendy,
R. Golding, A. Merchant, M. Spasojevic,
A. Veitch, J. Wilkes, MINERVA: An
Automated Resource Provisioning Tool for Large-Scale Storage Systems,
ACM TOCS, 19(4):483-518, Nov. 2001. •
Y. Saito, S. Frølund, A. Veitch, A. Merchant, S. Spence: FAB: building
distributed enterprise disk arrays from commodity components. ASPLOS 2004 •
S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, The Case for a Versatile Storage System,
HotStorage’09. |
W10 |
03/22 |
No class – travel |
|
W11 |
03/29 |
System-level issues: storage systems / operating systems (Annie) |
Required •
Mesnier et al, File Classification
in Self-* Storage Systems, ICAC’04. Optional •
Self* storage project [link]
|
W12 |
04/05 |
UBC closed |
|
W13 |
04/12 |
Protection and security issues (MohammadA) |
Required •
M. Costa, J. Crowcroft, M.
Castro, A. Rowstron, L. Zhou, L. Zhang, P. Barham, Vigilante:
End-to-end Containment of Internet Worms, SOSP, 2005. •
Yi-Min Wang, Ming Ma, Strider
Search Ranger: Towards an Autonomic Anti-Spam Search Engine, ,
ICAC’07 Optional •
D. Chess, C. Palmer, S. White, Security in an
autonomic computing environment, IBM Systems Journal, 2003. |
W14 |
04/19 |
Nature-inspired systems / / Real-world experience (Emalayan) |
Required
•
F. Qin, J. Tucek,
J. Sundaresan, and Y. Zhou, Rx: Treating Bugs as
Allergies - A safe method to survive software failures, In Proc. SOSP,
2005. •
P. Snyder, R. Greenstadt, G.
Valetto. Myconet: A Fungi-inspired Model for Superpeer-based
Peer-to-Peer Overlay Topologies. SASO’09. Optional •
ACMS: The Akamai Configuration Management System, NSDI ’05, •
I. Cohen, M. Goldszmidt, T.
Kelly, J. Symons, J. Chase, Correlating
instrumentation data to system states: A building block for automated
diagnosis and control, OSDI, 2004. •
P. Reynolds, C. Killian, J. Wiener, J. Mogul, M. Shah, A. Vahdat, Pip:
Detecting the Unexpected in Distributed Systems, NSDI 2006. •
S. Hofmeyr and S. Forrest, Architecture for
an Artificial Immune System, Evolutionary Computation 7, No. 1, 1289-1296
2000. •
D. Knoester, P. McKinley. Evolution
of Probabilistic Consensus in Digital Organisms, SASO’09 |
W15 |
04/26 |
Course summary and wrap-up. [Project: presentations and wrap-up] |
|
Other links:
1.
Autonomic Computing @ IBM: http://www.research.ibm.com/autonomic/
2.
Recovery-oriented Computing: http://roc.cs.berkeley.edu/