Information
Course: Introduction to Big Data Algorithms
Instructor: Emilio Cruciani
When: Wednesdays 14:00-17:00, from 08/03 to 28/06
Where: Rechnerraum
Office Hours: By Appointment (emilio dot cruciani at plus dot ac dot at), Room 1.06 or Online
Course: Introduction to Big Data Algorithms
Instructor: Emilio Cruciani
When: Wednesdays 14:00-17:00, from 08/03 to 28/06
Where: Rechnerraum
Office Hours: By Appointment (emilio dot cruciani at plus dot ac dot at), Room 1.06 or Online
Modern datasets are massive. With more than 500 million tweets and over 8.5 billion Google searches per day, how can one make sense of all the data that is generated? This course will present an introduction to advanced algorithmic tools for the analysis of such large datasets.
Topics Overview: hashing techniques for finding similar items; algorithms for dimensionality reduction and data clustering; sampling and sketching for data streams; link analysis and graph mining.
There will be a set of 5 assignments. The exam is oral, and will be mostly based on the questions of the assignment. The assignments are not mandatory, but handing them in will result in bonus points at the exam.
Assignment 1. Due date: 31/03/23, 23:59
Assignment 2. Due date: 28/04/23, 23:59
Assignment 3. Due date: 23/05/23, 23:59
Assignment 4. Due date: 13/06/23, 23:59
Assignment 5. Due date: 05/07/23, 23:59
N.B. The handwritten notes may contain some errors and are uploaded as a guideline only.
08/03/23: Course Introduction: Big Data, Pitfalls; Review of Probability: Events and Probability, Discrete Random Variables, Expectation
Books: Chapters 1.1, 1.2 [MMDS], Chapters 1, 2, 3, 4 [PC]
Slides: Course Introduction, Review of Probability
Readings: Interview with Michael Jordan
15/03/23: Review of Probability: Concentration Bounds, Exercises; Finding Similar Items: Jaccard Similarity, Shingling, Min-hashing
Books: Chapters 3.1, 3.2, 3.3 [MMDS], Chapters 3, 4 [PC]
Slides: Finding Similar Items (1)
Notes: Probability Exercises
Papers: [1, 2, 3]
22/03/23: Finding Similar Items: Locality Sensitive Hashing: Jaccard Similarity/Distance; Metrics; LSH Theory: Locality Sensitive Hash Families, AND/OR Constructions
Books: Chapters 3.4, 3.5, 3.6 [MMDS]
Slides: Finding Similar Items (2)
Notes: LSH Theory, Review of Metrics
Papers: [4, 5]
29/03/23: LSH for Other Metrics: Hamming, Cosine, Euclidean Distances; Review of Linear Algebra: Matrices, Vectors, Eigenvalues & Eigenvectors, Power Iteration Method; Curse of Dimensionality; Dimensionality Reduction: Random Projections (Johnson-Lindenstrauss Lemma), Principal Component Analysis (PCA)
Books: Chapters 3.7, 7.1.3, 11.1, 11.2 [MMDS], Chapter 1, 2, 3, 4, 5, 6 [LA], Chapter 7.3.1 [MC]
Notes: LSH Theory, Review of Linear Algebra, Dimensionality Reduction
Papers: [4, 5, 6, 7, 8]
09/04/23: No Lecture (Easter Break)
12/04/23: No Lecture (Easter Break)
19/04/23: Dimensionality Reduction: Singular Value Decomposition (SVD), Best Low-Rank Approximation, CUR Matrix Approximation
Books: Chapter 11.3, 11.4 [MMDS], Chapter 7 [LA], Chapter 2.5 [MC]
Slides: Dimensionality Reduction
Papers: [9, 10, 11]
26/04/23: Clustering: Hierarchical Clustering, K-Center, K-Means, K-Means++, BFR, CURE
Books: Chapter 7.1, 7.2, 7.3, 7.4 [MMDS]
Slides: Clustering
Papers: [12, 13, 14]
03/05/23: Mining Frequent Itemsets. Market-Basket Model, Association Rules, Algorithms: A-Priori, Eclat, Park-Chen-Yu, Multistage/Multihash, Random Sampling, Savarese-Omiecinski-Navathe, Toivonen
Books: Chapter 6.1, 6.2, 6.3, 6.4 [MMDS]
Slides: Mining Frequent Itemsets
Papers: [15, 16, 17, 18, 19, 20, 21]
10/05/23 (14.00-15.00 in T03; 15.00-17.00 in Rechnerraum): Office Hours During Regular Class Time (No Lecture)
17/05/23: Mining Data Streams. Sliding Window Model: Datar-Gionis-Indyk-Motwani; Sampling: Fixed Proportion, Fixed Size (Reservoir); Filtering: Bloom Filter
Books: Chapter 4.1, 4.6, 4.2, 4.3 [MMDS]
Slides: Data Streams (1)
Notes: DGIM Error Bound
Papers: [22, 23, 24, 25]
24/05/23: Mining Data Streams. Estimating Moments: Morris Counter, Flajolet-Martin, Alon-Matias-Szegedy, Median of Means; Frequent Items (Heavy Hitters): Misra-Gries
Books: Chapter 4.4, 4.5 [MMDS]
Slides: Data Streams (2)
Papers: [26, 27, 28, 29]
31/05/23: No Lecture (Instructor Away)
07/06/23: Mining Data Streams: Frequent Items (Heavy Hitters): Count-Min Sketch. Link Analysis: The Web, PageRank, Personalized PageRank
Books: Chapter 5.1, 5.3 [MMDS]
Slides: Link Analysis
Notes: Heavy Hitters
Papers: [30, 31, 32, 33]
14/06/23: Link Analysis: HITS (Hubs and Authorities). Graph Mining (Clustering and Partitioning): Betweenness, Modularity, Girvan-Newman Divisive, Newman Agglomerative, Spectral Modularity Maximization
Books: Chapter 5.5, 10.1, 10.2 [MMDS]
Slides: Link Analysis, Graph Mining (1)
Papers: [34, 35, 36, 37]
21/06/23: Graph Mining (Clustering and Partitioning): Spectral Clustering, Fiedler Vector, K-Way Spectral Partitioning, Louvain Algorithm, Label Propagation Algorithms
Books: Chapter 10.4 [MMDS]
Slides: Graph Mining (1), Graph Mining (2)
Papers: [38, 39, 40, 41, 42]
29/06/23: Office Hours (No Lecture)
Books
[MMDS] Mining of Massive Datasets by J. Leskovec, A. Rajaraman, and J. Ullman (Cambridge University Press, 3rd edition, 2019). [website]
[PC] Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis by M. Mitzenmacher, and E. Upfal (Cambridge University Press, 2nd edition, 2017).
[LA] Introduction to Linear Algebra by G. Strang (Cambridge University Press, 5th edition, 2016).
[MC] Matrix Computations by G.H. Golub and C.F. Van Loan (Johns Hopkins University Press, 3rd edition, 1996).
Papers
A.Z. Broder. “On the resemblance and containment of documents”. Compression and Complexity of Sequences, 1997.
A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher. “Min-wise independent permutations”. Symposium on Theory of Computing, 1998.
P. Li, A.B. Owen, and C.H. Zhang. “One permutation hashing”. Conference on Neural Information Processing Systems, 2012.
P. Indyk and R. Motwani. “Approximate nearest neighbor: towards removing the curse of dimensionality”. Symposium on Theory of Computing, 1998.
A. Gionis, P. Indyk, and R. Motwani. “Similarity search in high dimensions via hashing”. International Conference on Very Large Databases, 1999.
M.S. Charikar. “Similarity estimation techniques from rounding algorithms”. Symposium on Theory of Computing, 2002.
M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni. “Locality-sensitive hashing scheme based on p-stable distributions”. Symposium on Computational Geometry, 2004.
D. Achlioptas. "Database-friendly random projections". Symposium on Principles of Database Systems, 2001.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. “Indexing by latent semantic analysis”. Journal of the American Society for Information Science, 1990.
M.W. Mahoney and P. Drineas. “CUR matrix decompositions for improved data analysis”. Proceedings of the National Academy of Sciences, 2009.
J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. “Less is more: compact matrix decomposition for large sparse graphs”. International Conference on Data Mining, 2007.
D. Arthur and S. Vassilvitskii. “k-means++: the advantages of careful seeding”. Symposium on Discrete Algorithms, 2007.
P.S. Bradley, U.M. Fayyad, and C. Reina. “Scaling clustering algorithms to large databases”. Knowledge Discovery and Data Mining, 1998.
S. Guha, R. Rastogi, and K. Shim. “CURE: an efficient clustering algorithm for large databases”. International Conference on Management of Data, 1998.
R. Agrawal, T. Imielinski, and A. Swami. “Mining associations between sets of items in massive databases”. International Conference on Management of Data, 1993.
R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. International Conference on Very Large Databases, 1994.
M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. “New algorithms for fast discovery of association rules”. International Conference on Knowledge Discovery & Data Mining, 1997.
J.S. Park, M.S. Chen, and P.S. Yu. “An effective hash-based algorithm for mining association rules”. International Conference on Management of Data, 1995.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J.D. Ullman. “Computing iceberg queries efficiently”. International Conference on Very Large Databases, 1998.
A. Savasere, E. Omiecinski, and S.B. Navathe. “An efficient algorithm for mining association rules in large databases”. International Conference on Very Large Databases, 1995.
H. Toivonen. “Sampling large databases for association rules”. International Conference on Very Large Databases, 1996.
M. Datar, A. Gionis, P. Indyk, and R. Motwani. “Maintaining stream statistics over sliding windows”. SIAM Journal on Computing, 2002.
P.B. Gibbons. “Distinct sampling for highly-accurate answers to distinct values queries and event reports”. International Conference on Very Large Databases, 2001.
J. Vitter. “Random sampling with a reservoir”. ACM Transactions on Mathematical Software, 1985.
B.H. Bloom. “Space/time trade-offs in hash coding with allowable errors”. Communication of the ACM, 1970.
R. Morris. “Counting large numbers of events in small registers”. Communications of the ACM, 1978.
P. Flajolet and G.N. Martin. “Probabilistic counting for database applications”. Symposium on Foundations of Computer Science, 1983.
N. Alon, Y. Matias, and M. Szegedy. “The space complexity of approximating frequency moments”. Symposium on Theory of Computing, 1996.
J. Misra and D. Gries. “Finding repeated elements”. Science of Computer Programming, 1982.
G. Cormode and S. Muthukrishnan. “An improved data stream summary: the count-min sketch and its applications”. Journal of Algorithms, 2005.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Weiner. “Graph structure in the web”. Computer Networks, 2000.
S. Brin and L. Page. “Anatomy of a large-scale hypertextual web search engine”. International World-Wide-Web Conference, 1998.
T.H. Haveliwala. “Topic-sensitive PageRank”. International World-Wide-Web Conference, 2002.
J.M. Kleinberg. “Authoritative sources in a hyperlinked environment”. Journal of the ACM, 1999.
M. Girvan and M.E.J. Newman. “Community structure in social and biological networks”. PNAS, 2002.
M.E.J. Newman. “Fast algorithm for detecting community structure in networks”. Physical review E, 2004.
M.E.J. Newman. “Modularity and community structure in networks”. PNAS, 2006.
M. Fiedler. “Algebraic connectivity of graphs”. Czechoslovak Mathematical Journal, 1973.
M. Fiedler. “Laplacian of graphs and algebraic connectivity”. Banach Center Publications, 1989.
J. Shi and J. Malik. “Normalized cuts and image segmentation”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.
V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebvre. “Fast unfolding of communities in large networks”. Journal of Statistical Mechanics: Theory and Experiment, 2008.
U.N. Raghavan, R. Albert, S. Kumara. “Near linear time algorithm to detect community structures in large-scale networks”. Physical Review E, 2007.