Topics in Data-Intensive Computing Systems

 

Instructor: Matei Ripeanu

Schedule: Tue/Thu, 4:30-6:00

Location:

 

Mailing list:

 

Course description

 

Modern science is often data-intensive: large-scale simulations, new scientific instruments, and deployed sensor networks all generate impressive volumes of data that often need to be analyzed by large, geographically dispersed user communities in fields as diverse as genomics and high-energy physics.

This graduate course will cover fundamentals of data management in distributed systems, large‑scale data storage systems, and their interaction with data‑intensive computing systems.  Advances in all these directions are the foundation of recent efforts to build cyber-infrastructure.

The course will explore solutions to provide fast access to data and to improve data availability and durability under various consistency, scale, and component failure regime constraints.  Students will be exposed to a range of distributed data storage techniques from traditional (distributed) file-systems, to cooperative internet proxy caches, to peer-to-peer file‑sharing, to virtual data concepts and their integration with massive computing systems. 

 

Course structure

Three hours of classes per week, with time divided roughly in equally between traditional lectures and student presentations/group discussions of recent research results.

 

Course outline (tentative weekly topics)

  1. Introduction. Overview of current research problems, technologies, and applications.
  2. File system semantics, data durability and availability, replication and consistency, fault-tolerance.
  3. Data storage technologies. Storage hierarchies. Capacity management.
  4. Applications, data access patterns, workload characterization. Scientific data workloads.
  5. Integration with compute systems. Grids and Virtual Data
  6. Performance focus: caching, parallel access, striping.
  7. Structured overlays. Distributed hash tables.  File systems harnessing structured overlays.
  8. Security.
  9. Applications I: Experience with deployed systems. (NFS, AFS, Google File System)
  10. Applications II: Data archival. Cooperative internet proxy caches. Content distribution networks.
  11. Applications III: Peer-to-peer file-sharing (BitTorrent, FreeLoader)
  12. Project presentations

 

Team project

Each team (2-3 members) examines a particular distributed systems topic focusing on data related issues.  While a set of projects will be proposed, students are encouraged to define a project of their own: either characterize an existing system, or propose and evaluate techniques to improve existing systems, or prototype a new system. Note that it is critical that students present why a particular approach is used and how it contributes with rational explanation based on scientific or engineering knowledge leveraged by the literature search.  The result is evaluated by both the report in a standard form of IEEE publications and oral presentation.

 

References

Books (recommended):

  1. The Grid: Blueprint for a New Computing Infrastructure, Ian Foster, Carl Kesselman editors, 2nd Edition, Morgan Kaufmann, 2004.
  2. Reliable Distributed Systems: Technologies, Web Services, and Applications, Kenneth Birman, Springer, 2005

 

Journals:

  1. ACM Transactions on Storage Systems
  2. IEEE Transactions on Parallel and Distributed Systems
  3. IEEE/ACM Transactions on Networking

 

Conferences:

  1. USENIX Conference on File and Storage Technologies (FAST)
  2. USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI)
  3. USENIX Symposium on Operating Systems Design and Implementation (OSDI)
  4. IEEE International Symposium on High Performance Distributed Computing (HPDC)
  5. IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing – SC)

 

Grading

Research paper reviews, class participation: 50%

Project report and presentation: 50%