CS 6650 (2020 FA) - Readling list

Week 1. Intro

Ghosh, Sukumar, and Safari, an O'Reilly Media Company. Distributed Systems, 2nd Edition (2014).
- Chapter 1
- Library link: https://onesearch.library.northeastern.edu/permalink/f/365rt0/NEU_ALMA51284197480001401

Additional resources

M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.
- Chapter 1
- Online link: https://www.distributed-systems.net/index.php/books/ds3/

Week 2. Concurrency

Processes and threads
- http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf
- Chapters 3.0, 3.1, and 3.3 Distributed Systems, 3rd edition, 2017, Maarten van Steen, Andrew S. Tanenbaum
C++ concurrency
- Chapters 1 - 5, C++ Concurrency in Action," 2nd Edition, Williams, Anthony, Manning Publications
  - This book is accessible online through the library
  - This book is a good reference in general for C++ multi-threaded programming.
SEDA: an architecture for well-conditioned scalable Internet services, ACM Symposium on Operating System Principles, 2001, Matt Welsh (UC Berkeley, now at OctoML), David Culler (UC Berkeley) and Eric Brewer (UC Berkeley)
- http://www.sosp.org/2001/papers/welsh.pdf

Additional resources
- Why threads are a bad idea (for most purposes), USENIX Technical Conference, invited talk, 1995, John Ousterhout (Sun Microsystem Labs; now at Stanford)
  - https://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf
- Why events are a bad idea (for high-concurrency servers), USENIX Workshop on Hot Topics in Operating Systems, 2003, Rob von Behren (UC Berkeley, now at Google), Jeremy Condit (UC Berkeley, now at Google), and Eric Brewer (UC Berkeley)
  - https://www.usenix.org/legacy/events/hotos03/tech/full_papers/vonbehren/vonbehren.pdf

Week 3 Communications

Network in general
- Chapter 1, Computer Networks, 5th Edition by Bruce S. Davie; Larry L. Peterson, Published by Morgan Kaufmann, 2011 (available online through Northeastern library)
Socket programming
- Beej’s Guide to Network Programming, https://beej.us/guide/bgnet/
RPC
- Chapter 6.0-6.11, Kenneth P. Birman, Guide to Reliable Distributed Systems, 1st edition, Springer (available online through Northeastern library)
- Birrell and Nelson, Implementing remote procedure call, ACM Transactions on Computer Systems, Vol. 2, No. 1, 1984
  - https://dl.acm.org/doi/10.1145/2080.357392
Logical clocks
- Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (July 1978)

Additional resources:
- If you want to learn more about computer networks, you can read this book. It is a widely used textbook for computer networks course and covers a variety of topics in depth.
  - Computer Networks, 5th Edition by Bruce S. Davie; Larry L. Peterson, Published by Morgan Kaufmann, 2011 (available online through Northeastern library)
- Friedemann Mattern, "Virtual Time and Global States of Distributed Systems", Workshop on Parallel and Distributed Algorithms, Elsevier, pp. 215–226, 1989
- Colin J. Fidge, "Timestamps in Message-Passing Systems That Preserve the Partial Ordering". 11th Australian Computer Science Conference. pp. 56–66. 1988.

Week 4 Virtualization, containers, cloud, and data centers

Cloud computing
- Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. A view of cloud computing. Commun. ACM 53, 4 (April 2010), 50–58, 2010.
  - https://cacm.acm.org/magazines/2010/4/81493-a-view-of-cloud-computing/fulltext
Virtualization
- Understanding Full Virtualization, Paravirtualization, and Hardware Assist
  - https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/VMware_paravirtualization.pdf
Cluster management
- Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems
  - https://research.google/pubs/pub43438/

Additional resources:
- Data centers
  - Luiz André Barroso; Urs Hölzle; Parthasarathy Ranganathan; Margaret Martonosi, The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition , Morgan & Claypool, 2018.
    - (Available online through Northeastern library)
- Virtualization
  - Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the art of virtualization. In Proceedings of the nineteenth ACM symposium on Operating systems principles
    - https://www.cl.cam.ac.uk/research/srg/netos/papers/2003-xensosp.pdf

Week 5 Replication

Replication
- Chapter 16.0 – 16.5, Ghosh, Sukumar, and Safari, an O'Reilly Media Company. Distributed Systems, 2nd Edition (2014).
  - (Available online through Northeastern library)
Weak consistency
- Doug terry, Replicated Data Consistency Explained Through Baseball, MSR Technical Report, October 2011
  - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/10/ConsistencyAndBaseballReport.pdf
Chain-replication
- Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation.
  - https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf

Additional resources
- Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera, and Hussam Abu-Libdeh. 2013. Consistency-based service level agreements for cloud storage. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
  - https://www.microsoft.com/en-us/research/publication/consistency-based-service-level-agreements-for-cloud-storage/
- Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, Doug Terry, Epidemic algorithms for replicated database maintenance, in 6th ACM Symposium on Principles of distributed computing (PODC), August 1987, pages 1--12.
  - https://dl.acm.org/doi/10.1145/41840.41841 (Available through Northeastern library: access ACM digital library)
- Jeff Terrace and Michael J. Freedman. 2009. Object storage on CRAQ: high-throughput chain replication for read-mostly workloads. In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, USA, 11.
  - https://www.usenix.org/legacy/events/usenix09/tech/full_papers/terrace/terrace.pdf

Week 6, 7 Consensus and distributed transactions

Consensus in general
- Chapter 13.1 – 13.2, Ghosh, Sukumar, and Safari, an O'Reilly Media Company. Distributed Systems, 2nd Edition (2014).
  - (Available online through Northeastern library)
Paxos
- Pages 438-449, M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.
  - (Online link: https://www.distributed-systems.net/index.php/books/ds3/ )
Raft
- https://raft.github.io/raft.pdf
- https://www.youtube.com/watch?v=LAqyTyNUYSY
2PC/3PC
- Chapter 10.3.0 - 10.3.2, Kenneth P. Birman, Guide to Reliable Distributed Systems, 1st edition, Springer
  - (Available online through Northeastern library)

Additional resources:
- Leslie Lamport, Paxos made simple, 2001
  - https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
- Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos Made Moderately Complex. ACM Comput. Surv. 47, 3, Article 42
  - https://paxos.systems/paper/
- Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation
  - https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985)
  - (Log in to ACM digital library through Northeastern library to access the article)

Week 8 NoSQL databases

CAP theorem
- E. Brewer, "CAP twelve years later: How the "rules" have changed" in Computer, vol. 45, no. 02, pp. 23-29, 2012.
NoSQL vs SQL
- https://www.ibm.com/cloud/blog/sql-vs-nosql
Google Bigtable
- https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Amazon Dynamo
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: amazon's highly available key-value store. SIGOPS Oper. Syst. Rev. 41, 6 (December 2007), 205–220.
  - (Log in to ACM digital library through Northeastern library to access the article)

Additional reading
- D. Pritchett. BASE: An ACID Alternative. ACM Queue, July 28, 2008.
- Google Files System
  - https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Week 9 In-memory systems

RAMCloud
- John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin Park, Henry Qin, Mendel Rosenblum, Stephen Rumble, Ryan Stutsman, and Stephen Yang. 2015. The RAMCloud Storage System. ACM Trans. Comput. Syst. 33, 3, Article 7
  - https://dl.acm.org/doi/pdf/10.1145/2806887
DMA
- Chapter 12.2 I/O Hardware, Galvin, Peter, Silberschatz, Abraham, and Gagne, Greg. Operating System Concepts Essentials. 1st ed. Wiley, 2010.
  - (Available through Northeastern library)
RDMA
- https://www.rdmamojo.com/2014/03/31/remote-direct-memory-access-rdma/
FaRM: fast remote memory
- Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14)
  - https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf

Additional reading:
- Infiniband
  - https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf

Week 10 Load balancing, caching, and contents delivery network

Load balancer
- https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/elb-ug.pdf
CDN
- https://www.imperva.com/learn/performance/cdn-guide/
Facebook's caching systems
- Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache at Facebook. In Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
  - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf
- Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. An analysis of Facebook photo caching. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
  - https://dl.acm.org/doi/pdf/10.1145/2517349.2522722

Week 11 Data analytics

MapReduce
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107–113.
- https://research.google/pubs/pub62/
Stream processing
- Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems, O'Reilly Media, Inc.
  - Chapter 1 and Chapter 10
  - (Available through NEU library)
Spark
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (November 2016), 56–65.
  - https://doi.org/10.1145/2934664

Additional resources
- Spark RDD
  - Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12).
  - https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
- Spark streaming
  - Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). Association for Computing Machinery, New York, NY, USA, 423–438.
  - https://cs.stanford.edu/~matei/papers/2013/sosp_spark_streaming.pdf

Week 13 Microservices, serverless computing, SDN, and blockchains

Microservices:
- Sam Newman, Building Microservices, 2nd Edition, O'Reilly Media
- Chapter 1
Serverless computing:
- Jason Katzer, Learning Serverless, 2020, O'Reilly Media, Inc.
- Chapter 1
Software defined networks:
- Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. 2008. OpenFlow: enabling innovation in campus networks. SIGCOMM Comput. Commun. Rev. 38, 2 (April 2008), 69–74.
- http://ccr.sigcomm.org/online/files/p69-v38n2n-mckeown.pdf
Blockchains:
- Satoshi Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System
- https://bitcoin.org/bitcoin.pdf

Additional reading:
- Microservices
  - Chris Richardson, Microservices Patterns, 2018, Manning Publications
    - Chapter 1
  - https://www.ibm.com/cloud/learn/microservices
- SwitchKV
  - Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, and Michael J. Freedman. 2016. Be fast, cheap and in control with SwitchKV. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, USA, 31–44.
- NetCache
  - Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). Association for Computing Machinery, New York, NY, USA, 121–136.