Scalable Data Science (CS-563)

I have offered Scalable Data Science (CS-563) at IIT Mandi in August-Dec, 2021 semester. Below are the details.

Preamble: Recent technological advancement in the areas such as WWW, IoT, social network, e-commerce have generated a large volume of high dimensional datasets. Learning insights from such datasets, in order to make important decisions, are ubiquitous in several real-life applications such as internet search, recommendation, network traffic monitoring, machine learning, signal processing. Performing analytics on large datasets is not only challenging but impossible at times, despite having access to powerful computer resources. The course attempts to address this challenge by covering scalable and practical algorithmic techniques that “compress big data into tiny data” with provable mathematical guarantees such that the result of the analytics on compressed data closely approximates the corresponding results of the original datasets. Algorithmic techniques covered in the course fit well in the intersection between theory and practice, have guarantees on their accuracy and efficiency, and can be easily implemented.

Syllablus of the course (link).

Moodle page link at IIT Mandi (link)

Lecture 1 - Introduction
- Scribe
- Handouts

Lecture 2-3 (Introduction to streaming algorithms, reservoir sampling, set-membership problem and its solution using bloom filters)
- Scribe (Lecture 2-3)
- Handouts
- Additional References
  - Network Applications of Bloom Filters: A Survey Andrei Broder and Michael Mitzenmacher.

Lecture 4 (Universal Hashing)
- Scribe (Lecture 4)
- Handouts
- Additional References

Lecture 5-6 (Frequency estimation via Count-Min-Sketch)
- Scribe (Lecture 5-6)
- Handouts
- Additional Reference
  - An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode and S. Muthukrishnan. LATIN-2004.
  - Count-min sketch & its applications.
  - Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.

Lecture 7 (AMS sketch for estimating \ell_2 norm of frequency vector)
- Scribe (Lecture 7)
- Handouts
- Additional References
  - The space complexity of approximating the frequency moments. Noga Alon, Yossi Matias, and Mario Szegedy. STOC-1996.
  - Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.

Lecture 8-9 (Frequency estimation via Count-Sketch)
- Scribe (Lecture 8_9)
- Handouts
- Additional References
  - Finding Frequent Items in Data Streams. Moses Charikar, Kevin Chen, and Martin Farach-Colton. ICALP-2002.
  - Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.

Lecture 10 (Approximate Counting -- Morris Counter)
- Scribe (Lecture 10)
- Handouts
- Additional References
  - Counting Large Numbers of Events in Small Registers. Robert Morris.
  - Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.

Lecture 11-12 (Approximating distinct element in stream - Flajolet-Martin Sketch)
- Scribe (Lecture 11-12)
- Handouts
- Additional References
  - Probabilistic counting algorithms for database applications. Philippe Flajolet and G. Nigel Martin.J. Comput. Syst. Sci., 31(2):182–209, 1985.
  - Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.

Lecture 13 (Improved algorithm for approximating distinct elements)
- Scribe (Lecture 13)
- Handouts
- Additional References
  - Counting Distinct Elements in a Data Stream. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, Luca Trevisan. RANDOM-2002.
  - Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.

Lecture 14-15 (Computing heavy-hitters + Introduction to dimensionality reduction -JL lemma)
- Scribe (Lecture 14-15)
- Handouts
- Additional References
  - Finding repeated elements. Jaydev Misra, David Gries. Sci. Comput. Program., 2(2):143–152, 1982.

Lecture 16 (JL lemma proof)
- Handouts
- Additional References
  - An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Sanjoy Dasgupta, Anupam Gupta.

Lecture 17-18 (Improved variants of JL lemma)
- Scribe (Lecture 16 and 17)
- Handouts(Lec 16-17-18)
- Additional References
  - Database-friendly Random Projections. Dimitris Achlioptas. PoDS-2001.
  - Very Sparse Random Projections. Li, Hastie, Church. KDD-2006.

Lecture 19 (Fast JL Transform)
- Scribe (Lecture 18, 19, 20)
- Handouts
- Additional References
  - Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform. Ailon and Chazelle. STOC-2006.

Lecture 20-21 (Fast JL Transform proof completion + Signed Random Projection (SRP))
- Scribe (Lecture 21)
- Handouts(Lec 19,20, 21)
- Additional References
  - Similarity Estimation Techniques from Rounding Algorithms. Moses S. Charikar. STOC-2002.
  - Circulant Binary Embedding. Felix X. Yu, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang. ICML-2014.

Lecture 22 (Minwise Independent Permutations - MinHash, and its improved variants)
- Scribe (Lecture 22)
- Additional References
  - Min-Wise Independent Permutations. Broder, Charikar, Frieze, Mitzenmacher. STOC 1998:
  - b-Bit minwise hashing. Ping Li, Arnd Christian König, WWW-2010.
  - One Permutation Hashing. Ping Li, Art B. Owen, Cun-Hui Zhang, NIPS 2012.

Lecture 23-24 (Dimensionality reduction for Binary Data (BinSketch), Feature Hashing)
- Scribe (Lecture 23-24)
- Handouts (Lec 22, 23, 24)
- Additional References
  - Efficient Sketching Algorithm for Sparse Binary Data. Pratap, Bera, Revanuru. ICDM-2019.
  - Feature Hashing for Large Scale Multitask Learning. Weinberger, Dasgupta, Langford, Smola, Attenberg, ICML-2009.

Lecture 25 (Dimensionality reduction for \ell_p norm via p-stable distribution, for p \in (0, 2])
- Scribe (Lecture 25)
- Additional References
  - Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Piotr Indyk. FOCS-2000.

Lecture 26-27 (Locality sensitive hashing framework for approximate nearest neighbor search)
- Scribe (Lecture 26-27)
- Handouts (Lec 25, 26, 27)
- Additional References
  - Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Piotr Indyk, Rajeev Motwani, STOC 1998.
  - Locality Sensitive Hashing (LSH) Home Page. Alexandr Andoni.

Lecture 28 (Sign-random-projection (SRP)/SimHash+ Hamming LSH)
- Scribe (Lecture 28)
- Handouts
- Additional References
  - Similarity Estimation Techniques from Rounding Algorithms. Moses S. Charikar. STOC-2002.
  - Similarity Search in High Dimensions via Hashing. Gionis, Indyk, Motwani-VLDB-1999.

Lecture 29 (LSH for p-stable distributions for p \in (0, 2])
- Scribe (Lecture 29)
- Handouts
- Additional References
  - Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni. SCG-2004.
  - Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Piotr Indyk, Rajeev Motwani, STOC 1998.

Lecture 30-31 (LSH for Inner product for Real valued/Binary vectors + applications of LSH)
- Scribe (Lecture 30-31)

- Handouts (Lecture 30-31)
- Additional References
  - Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment. Anshumali Shrivastava, Ping Li. WWW-2015.
  - Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). Anshumali Shrivastava, Ping Li. NIPS-2014.
  - On Symmetric and Asymmetric LSHs for Inner Product Search. Neyshabur and Srebro. ICML-2015.
  - Frequent-Itemset Mining using Locality-Sensitive Hashing. Debajyoti Bera and Rameshwar Pratap. COCOON-2015.
  - Scalable and Sustainable Deep Learning via Randomized Hashing. Spring and Shrivastava. KDD-2016.
  - Compressing Neural Networks with the Hashing Trick. Chen et al. ICML-2015.
  - Finding Interesting Associations without Support Pruning. Cohen et. al. IEEE-TKDE-2001.
  - Chapter 3 of Mining of Massive Datasets book. Jure Leskovec, Anand Rajaraman, Jeff Ullman.

Lecture 32 (Approximating Matrix Multiplication via Sampling)
- Scribe (Lecture 32)
- Handouts
- Additional References
  - Section 6.3 of Foundations of Data Science. Avrim Blum, John Hopcroft, and Ravindran Kannan 2018.

Lecture 33-34 (Approximating Matrix Multiplication via Sketching + Introduction to clustering)
- scribe (Lectures 33-34)
- Handouts (Lectures 32-33-34)
- Additional References
  - Compressed Matrix Multiplication. Rasmush Pagh. ACM Transactions on Computation Theory - 2013.

Lecture 35-36-37 (Lloyd's heuristic for k-means problem, k-means++ sampling and its analysis)
- Scribe (Lectures 35-36-37)
- Handouts (Lecture 35-36-37)
- Additional References
  - k-means++: The Advantages of Careful Seeding. David Arthur, Sergei Vassilvitskii, SODA-2007.

Lecture 38-39 (k-means++ sampling proof completion, improved variants of k means++ sampling, introduction to k-mode and spherical k-means clustering)
- Scribe (Lectures 38-39)
- Handouts (Lecture 38-39)
- Additional References
  - Scalable k-means++. Bahmani et al. VLDB-2012.
  - Fast and Provably Good Seedings for k-Means. Bachem et. al. NIPS-2016.
  - Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Zhexue Huang. Data Mining and Knowledge Discovery, 1998.
  - Concept Decompositions for Large Sparse Text Data Using Clustering. Inderjit S. Dhillon & Dharmendra S. Modha. Machine Learning journal -2001.

Lecture 40 (k-center clustering)
- Scribe (Lecture 40)
- Handouts

Lecture 41 (Page rank)
- Scribe (Lecture 41)
- Handouts

Similar courses taught in other universities: The reference material covered in the following courses are useful in some parts of the course.

Algorithms for Big Data, Fall 2020, David Woodruff (CMU).
Algorithms for Massive Data (TTIC 41000 - Special Topics) Sepideh Mahabadi (TTIC).
Data Stream Algorithms: Amit Chakrabarti (Dartmouth).
Algorithms for Big Data 2015 Jelani Nelson (Harvard).
Scalable Data Science. Anirban Dasgupta (IIT Gandhinagar) and Sourangshu Bhattacharya (IIT Kharagpur). NPTEL.
Algorithms for Big Data: Chandra Chekuri (UIUC).
Algorithms for Big Data: Grigory Yaroslavtsev (UPenn).
Algorithms for Modern Data Models: Ashish Goel (Stanford).
Models of Computation for Massive Data: Jeff Phillips (Utah).
The Modern Algorithmic Toolbox: Tim Roughgarden, Gregory Valiant (Stanford).
Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington).

Pointers to reference books/lecture notes/other relevant materials:

Data stream algorithm. Amit Chakrabarti. Lecture notes. 2020.
Sketching algorithm. Jelani Nelson. Lecture notes. 2020.
Small Summaries for Big Data. Graham Cormode and Ke Yi. Book link. 2020.
Randomized algorithms for matrices and data. Michael W. Mahoney. Book link, 2011.
Sketching as a Tool for Numerical Linear Algebra. David P. Woodruff. Book link. 2015.
Foundations of Data Science. Avrim Blum, John Hopcroft, and Ravindran Kannan. Book link. 2018.
Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. Book link. 2014.
Open problems in data streams and related topics IIT-Kanpur workshop on algorithms for data streams, 2006.

** Warning: The scribe's notes are still a bit rough. They are somewhat detailed descriptions of what has been discussed in the class. However, they may have some errors/imprecise statements. Please feel free to report any such errors/typos.

Google Sites

Report abuse