Data Mining and Knowledge Discovery in Data Science (Fall 2024)
Instructor: LI,Jia (office hour: We, 3:30PM -4:30pm)
TA: Gao, Shihong & Zhang, Jiawen & Tong, Bing
Lecture Time: Mon 3:00 PM - 5:50PM
Venue: E1 122
Introduction
This course, will cover some basic algorithms in data mining and data science, including data preprocessing, classification (decision tree classifier, support vector machine, ensemble), clustering (k-means, hierarchical clustering, spectral clustering), anomaly detection, graph analytics, Association Analysis, PageRank, dimensionality reduction, EM algorithm, etc.
Announcements
Sep. 1 hi all.
Grading
Mid exam: 50%
Project: 50%
Reference and Handouts
Reference books and courses for extra reading:
[1] Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
[2] Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber, and Jian Pei.
[3] Foundations of Data Science. Avrim Blum, John Hopcroft, and Ravindran Kannan.
[4] Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman.
[5] CMPSC1 689: Machine Learning. Subhransu Maji.
Handout 1: Introduction & Data Preprocessing
Handout 2: Decision Tree
Handout 3: Linear Models Andrew Ng's Notes
Handout 5: K-means
Handout 6: Hierarchical Clustering
Handout 7: Graph Theory
Handout 8: Spectral Clustering
Mid-term Exam (Nov 4)
Handout 9: Apriori
Handout 10: Anomaly Detection
Handout 11: PageRank
Handout 12: HITS and SimRank
Handout 13: PCA
Handout 14: EM
Handout 15: HMM
Exercises
Exam
See this as a demo.
Project
Each one chooses a research topic related to course material, e.g., data preprocessing, classification, clustering, anomaly detection. The report should follow ACM format with strict 6 pages limitation, including reference and appendix, see the following for reference https://kdd.org/kdd2021/calls/view/call-for-research-track-papers. Here are some tips:
The report should at least consist of introduction, related work, methodology and experiment. Theoretical deviation is not a necessity but encouraged.
Use concise and clear language.
Clearly declare your difference with previous works.
If there is any theoretical deviation, check your assumption and make sure it is non-fragile.
Resources
Anomaly detection
Outlier Detection and Description (KDD'21 workshop)
Graph analytics
Tools for large graph mining: structure and diffusion (WWW'08 tutorial)
Graph learning publications