I have offered Scalable Data Science (CS-563) at IIT Mandi in August-Dec, 2021 semester. Below are the details.
Preamble: Recent technological advancement in the areas such as WWW, IoT, social network, e-commerce have generated a large volume of high dimensional datasets. Learning insights from such datasets, in order to make important decisions, are ubiquitous in several real-life applications such as internet search, recommendation, network traffic monitoring, machine learning, signal processing. Performing analytics on large datasets is not only challenging but impossible at times, despite having access to powerful computer resources. The course attempts to address this challenge by covering scalable and practical algorithmic techniques that “compress big data into tiny data” with provable mathematical guarantees such that the result of the analytics on compressed data closely approximates the corresponding results of the original datasets. Algorithmic techniques covered in the course fit well in the intersection between theory and practice, have guarantees on their accuracy and efficiency, and can be easily implemented.
Syllablus of the course (link).
Moodle page link at IIT Mandi (link)
Lecture 1 - Introduction
Scribe
Handouts
Lecture 2-3 (Introduction to streaming algorithms, reservoir sampling, set-membership problem and its solution using bloom filters)
Additional References
Network Applications of Bloom Filters: A Survey Andrei Broder and Michael Mitzenmacher.
Lecture 4 (Universal Hashing)
Additional References
Lecture 5-6 (Frequency estimation via Count-Min-Sketch)
Additional Reference
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode and S. Muthukrishnan. LATIN-2004.
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Lecture 7 (AMS sketch for estimating \ell_2 norm of frequency vector)
Additional References
The space complexity of approximating the frequency moments. Noga Alon, Yossi Matias, and Mario Szegedy. STOC-1996.
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Lecture 8-9 (Frequency estimation via Count-Sketch)
Additional References
Finding Frequent Items in Data Streams. Moses Charikar, Kevin Chen, and Martin Farach-Colton. ICALP-2002.
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Lecture 10 (Approximate Counting -- Morris Counter)
Additional References
Counting Large Numbers of Events in Small Registers. Robert Morris.
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Lecture 11-12 (Approximating distinct element in stream - Flajolet-Martin Sketch)
Additional References
Probabilistic counting algorithms for database applications. Philippe Flajolet and G. Nigel Martin.J. Comput. Syst. Sci., 31(2):182–209, 1985.
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Lecture 13 (Improved algorithm for approximating distinct elements)
Additional References
Counting Distinct Elements in a Data Stream. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, Luca Trevisan. RANDOM-2002.
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Lecture 14-15 (Computing heavy-hitters + Introduction to dimensionality reduction -JL lemma)
Additional References
Finding repeated elements. Jaydev Misra, David Gries. Sci. Comput. Program., 2(2):143–152, 1982.
Lecture 16 (JL lemma proof)
Handouts
Additional References
An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Sanjoy Dasgupta, Anupam Gupta.
Lecture 17-18 (Improved variants of JL lemma)
Handouts(Lec 16-17-18)
Additional References
Database-friendly Random Projections. Dimitris Achlioptas. PoDS-2001.
Very Sparse Random Projections. Li, Hastie, Church. KDD-2006.
Lecture 19 (Fast JL Transform)
Handouts
Additional References
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform. Ailon and Chazelle. STOC-2006.
Lecture 20-21 (Fast JL Transform proof completion + Signed Random Projection (SRP))
Handouts(Lec 19,20, 21)
Additional References
Similarity Estimation Techniques from Rounding Algorithms. Moses S. Charikar. STOC-2002.
Circulant Binary Embedding. Felix X. Yu, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang. ICML-2014.
Lecture 22 (Minwise Independent Permutations - MinHash, and its improved variants)
Additional References
Min-Wise Independent Permutations. Broder, Charikar, Frieze, Mitzenmacher. STOC 1998:
b-Bit minwise hashing. Ping Li, Arnd Christian König, WWW-2010.
One Permutation Hashing. Ping Li, Art B. Owen, Cun-Hui Zhang, NIPS 2012.
Lecture 23-24 (Dimensionality reduction for Binary Data (BinSketch), Feature Hashing)
Additional References
Efficient Sketching Algorithm for Sparse Binary Data. Pratap, Bera, Revanuru. ICDM-2019.
Feature Hashing for Large Scale Multitask Learning. Weinberger, Dasgupta, Langford, Smola, Attenberg, ICML-2009.
Lecture 25 (Dimensionality reduction for \ell_p norm via p-stable distribution, for p \in (0, 2])
Additional References
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Piotr Indyk. FOCS-2000.
Lecture 26-27 (Locality sensitive hashing framework for approximate nearest neighbor search)
Additional References
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Piotr Indyk, Rajeev Motwani, STOC 1998.
Locality Sensitive Hashing (LSH) Home Page. Alexandr Andoni.
Lecture 28 (Sign-random-projection (SRP)/SimHash+ Hamming LSH)
Additional References
Similarity Estimation Techniques from Rounding Algorithms. Moses S. Charikar. STOC-2002.
Similarity Search in High Dimensions via Hashing. Gionis, Indyk, Motwani-VLDB-1999.
Lecture 29 (LSH for p-stable distributions for p \in (0, 2])
Additional References
Lecture 30-31 (LSH for Inner product for Real valued/Binary vectors + applications of LSH)
Additional References
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment. Anshumali Shrivastava, Ping Li. WWW-2015.
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). Anshumali Shrivastava, Ping Li. NIPS-2014.
On Symmetric and Asymmetric LSHs for Inner Product Search. Neyshabur and Srebro. ICML-2015.
Frequent-Itemset Mining using Locality-Sensitive Hashing. Debajyoti Bera and Rameshwar Pratap. COCOON-2015.
Scalable and Sustainable Deep Learning via Randomized Hashing. Spring and Shrivastava. KDD-2016.
Compressing Neural Networks with the Hashing Trick. Chen et al. ICML-2015.
Finding Interesting Associations without Support Pruning. Cohen et. al. IEEE-TKDE-2001.
Chapter 3 of Mining of Massive Datasets book. Jure Leskovec, Anand Rajaraman, Jeff Ullman.
Lecture 32 (Approximating Matrix Multiplication via Sampling)
Handouts
Additional References
Section 6.3 of Foundations of Data Science. Avrim Blum, John Hopcroft, and Ravindran Kannan 2018.
Lecture 33-34 (Approximating Matrix Multiplication via Sketching + Introduction to clustering)
Additional References
Compressed Matrix Multiplication. Rasmush Pagh. ACM Transactions on Computation Theory - 2013.
Lecture 35-36-37 (Lloyd's heuristic for k-means problem, k-means++ sampling and its analysis)
Additional References
k-means++: The Advantages of Careful Seeding. David Arthur, Sergei Vassilvitskii, SODA-2007.
Lecture 38-39 (k-means++ sampling proof completion, improved variants of k means++ sampling, introduction to k-mode and spherical k-means clustering)
Additional References
Scalable k-means++. Bahmani et al. VLDB-2012.
Fast and Provably Good Seedings for k-Means. Bachem et. al. NIPS-2016.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Zhexue Huang. Data Mining and Knowledge Discovery, 1998.
Concept Decompositions for Large Sparse Text Data Using Clustering. Inderjit S. Dhillon & Dharmendra S. Modha. Machine Learning journal -2001.
Lecture 40 (k-center clustering)
Handouts
Lecture 41 (Page rank)
Handouts
Similar courses taught in other universities: The reference material covered in the following courses are useful in some parts of the course.
Algorithms for Big Data, Fall 2020, David Woodruff (CMU).
Algorithms for Massive Data (TTIC 41000 - Special Topics) Sepideh Mahabadi (TTIC).
Data Stream Algorithms: Amit Chakrabarti (Dartmouth).
Algorithms for Big Data 2015 Jelani Nelson (Harvard).
Scalable Data Science. Anirban Dasgupta (IIT Gandhinagar) and Sourangshu Bhattacharya (IIT Kharagpur). NPTEL.
Algorithms for Big Data: Chandra Chekuri (UIUC).
Algorithms for Big Data: Grigory Yaroslavtsev (UPenn).
Algorithms for Modern Data Models: Ashish Goel (Stanford).
Models of Computation for Massive Data: Jeff Phillips (Utah).
The Modern Algorithmic Toolbox: Tim Roughgarden, Gregory Valiant (Stanford).
Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington).
Pointers to reference books/lecture notes/other relevant materials:
Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Sketching algorithm. Jelani Nelson. Lecture notes. 2020.
Small Summaries for Big Data. Graham Cormode and Ke Yi. Book link. 2020.
Randomized algorithms for matrices and data. Michael W. Mahoney. Book link, 2011.
Sketching as a Tool for Numerical Linear Algebra. David P. Woodruff. Book link. 2015.
Foundations of Data Science. Avrim Blum, John Hopcroft, and Ravindran Kannan. Book link. 2018.
Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. Book link. 2014.
Open problems in data streams and related topics IIT-Kanpur workshop on algorithms for data streams, 2006.