DSAA5002

Data Mining and Knowledge Discovery in Data Science (Fall 2024)

Instructor: LI,Jia (office hour: We, 3:30PM -4:30pm)

TA: Gao, Shihong & Zhang, Jiawen & Tong, Bing

Lecture Time: Mon 3:00 PM - 5:50PM

Venue: E1 122

Introduction

This course, will cover some basic algorithms in data mining and data science, including data preprocessing, classification (decision tree classifier, support vector machine, ensemble), clustering (k-means, hierarchical clustering, spectral clustering), anomaly detection, graph analytics, Association Analysis, PageRank, dimensionality reduction, EM algorithm, etc.

Announcements

Sep. 1 hi all.

Grading

Mid exam: 50%

Project: 50%

Reference and Handouts

Reference books and courses for extra reading:

[1] Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.

[2] Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber, and Jian Pei.

[3] Foundations of Data Science. Avrim Blum, John Hopcroft, and Ravindran Kannan.

[4] Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman.

[5] CMPSC1 689: Machine Learning. Subhransu Maji.

Handout 1: Introduction & Data Preprocessing
Handout 2: Decision Tree
Handout 3: Linear Models Andrew Ng's Notes
Handout 4: Kernels & Ensembles
Handout 5: K-means
Handout 6: Hierarchical Clustering
Handout 7: Graph Theory
Handout 8: Spectral Clustering
Mid-term Exam (Nov 4)
Handout 9: Apriori
Handout 10: Anomaly Detection
Handout 11: PageRank
Handout 12: HITS and SimRank
Handout 13: PCA
Handout 14: EM
Handout 15: HMM

Exercises

Exercise list1 Solution1

Exercise list2

Exercise list3 Solution3

Exam

See this as a demo.

Project

Each one chooses a research topic related to course material, e.g., data preprocessing, classification, clustering, anomaly detection. The report should follow ACM format with strict 6 pages limitation, including reference and appendix, see the following for reference https://kdd.org/kdd2021/calls/view/call-for-research-track-papers. Here are some tips:

The report should at least consist of introduction, related work, methodology and experiment. Theoretical deviation is not a necessity but encouraged.
Use concise and clear language.
Clearly declare your difference with previous works.
If there is any theoretical deviation, check your assumption and make sure it is non-fragile.

Resources

Anomaly detection

Outlier Detection and Description (KDD'21 workshop)

Graph analytics

Tools for large graph mining: structure and diffusion (WWW'08 tutorial)

Graph learning publications

Graph based deep learning literature

Page updated

Google Sites

Report abuse