CMPT 843 Big Data ANALYTICS

Instructor: Jian Pei, Ph.D. TA: Pei Wang

Classes: M/W/F 14:30-15:20 AQ5007

Office hours: (Instructor) Monday 10-11 am, TASC 9429 (TA) (Monday) 15:30-16:30, TASC 1 9217

Emails: (course mail list) cmpt-843@sfu.ca (Instructor) jpei (TA) peiw

About this course

The purpose of this graduate course is twofold: broadening graduate students' knowledge and understanding of the current frontiers of data analytics and management research, and teaching data analytics and data mining research methodology and skills.

In this semester, we will focus on big data analytics methods. Specifically, we will cover some fundamental and useful ideas and methods in handling big data, with the focus on sampling and computational statistics, as well as their programming implementation and applications. Since it is an advanced graduate course, sufficient preparation and interest in data analytics (i.e., database systems and data mining) and undergraduate entry level statistics are assumed.

Prerequisites

Solid comprehensive background in probability theory and statistics.

Reading List

Sara Ahmadian, Alessandro Epasto, Ravi Kumar, Mohammad Mahdian, "Clustering without Over-Representation", in KDD 2019.

Yanhao Wang, Yuchen Li, Kian-Lee Tan, "Coresets for Minimum Enclosing Balls over Sliding Windows", in KDD 2019.

Pei-Zhen Li, Ling Huang, Chang-Dong Wang, Jian-Huang Lai, "EdMot: An Edge Enhancement Approach for Motif-aware Community Detection", in KDD 2019.

Lijun Chang, "Efficient Maximum Clique Computation over Large Sparse Graphs", in KDD 2019.

Kirill Paramonov, Dmitry Shemetov, James Sharpnack, "Estimating Graphlet Statistics via Lifting", in KDD 2019.

Chi Wang, Bailu Ding, "Fast Approximation of Empirical Entropy via Subsampling", in KDD 2019.

Alexander Marx, Jilles Vreeken, "Identifiability of Cause and Effect using Regularized Regression", in KDD 2019.

Tomoki Yoshida, Ichiro Takeuchi, Masayuki Karasuyama, "Learning Interpretable Metric between Graphs: Convex Formulation and Computation with Graph Mining", in KDD 2019.

Kun Dong, Austin R. Benson, David Bindel, "Network Density of States", in KDD 2019.

Ari Kobren, Barna Saha, Andrew McCallum, "Paper Matching with Local Fairness Constraints", in KDD 2019.

Yi Li, Wei Xu, "PrivPy: General and Scalable Privacy-Preserving Data Mining", in KDD 2019.

Parikshit Ram, Kaushik Sinha, "Revisiting kd-tree for Nearest Neighbor Search", in KDD 2019.

[Xi Wang] Leonardo Pellegrina, Matteo Riondato, Fabio Vandin, "SPuManTE: Significant Pattern Mining with Unconditional Testing", in KDD 2019.

Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 1135-1152.

Daniel Ting. 2019. Approximate Distinct Counts for Billions of Datasets. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 69-86.

Abolfazl Asudeh, H. V. Jagadish, Gerome Miklau, Julia Stoyanovich, "On obtaining stable rankings", in VLDB 2019.

Yizhou Yan, Lei Cao, Samuel Madden, and Elke A. Rundensteiner. 2018. SWIFT: mining representative patterns from large event streams. Proc. VLDB Endow. 12, 3 (November 2018), 265-277.

Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P. Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal Zafar. 2018. A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual &Group Unfairness via Inequality Indices. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2239-2248.

Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2279-2288.

Chengxi Zang, Peng Cui, and Wenwu Zhu. 2018. Learning and Interpreting Complex Distributions in Empirical Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2682-2691.

Austin R. Benson, Ravi Kumar, and Andrew Tomkins. 2018. Sequences of Sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 1148-1157.

Biwei Huang, Kun Zhang, Yizhu Lin, Bernhard Schölkopf, and Clark Glymour. 2018. Generalized Score Functions for Causal Discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 1551-1560.

Abdulhakim A. Qahtan, Ahmed Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and Nan Tang. 2018. FAHES: A Robust Disguised Missing Values Detector. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2100-2109.

Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouët. 2018. Are your data gathered?. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2210-2218.

Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2219-2228.

David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant. 2018. Approximating the Spectrum of a Graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 1263-1271.

Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, Tamer Özsu: The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. VLDB 2018, 420 - 431.

Hazar Harmouch, Felix Naumann: Cardinality Estimation: An Experimental Survey. VLDB 2018, 499 - 512.

Sebastian Kruse, Felix Naumann: Efficient Discovery of Approximate Dependencies. VLDB 2018, 759 - 772.

Stephen Macke, Yiming Zhang, Silu Huang, Aditya Parameswaran: Adaptive Sampling for Rapidly Matching Histograms. VLDB 2018, 1262 - 1275.

Jun Yang, Pankaj K. Agarwal, Sudeepa Roy, Brett Walenz, You Wu, Cong Yu, Chengkai Li: Query Perturbation Analysis: An Adventure of Database Researchers in Fact-Checking. IEEE Data Eng. Bull. 41(3): 28-42 (2018)

Pankaj K. Agarwal, Kyle Fox, Kamesh Munagala, Abhinandan Nath, Jiangwei Pan, Erin Taylor: Subtrajectory Clustering: Models and Algorithms. PODS 2018: 75-87

A. Yu, P. K. Agarwal and J. Yang, "Top-k preferences in high dimensions," 2014 IEEE 30th International Conference on Data Engineering, Chicago, IL, 2014, pp. 748-759.

Pankaj K. Agarwal, Nirman Kumar, Stavros Sintos, Subhash Suri: Range-Max Queries on Uncertain Data. PODS 2016: 465-476.

CMPT 843 Big Data ANALYTICS

About this course

Prerequisites

Course NOtes

Presentation Schedule

Research Papers

Course Projects

Reading List