What should you do when you are facing a huge amount of complicated data from real life applications? This course introduces the core techniques in big data analytics, namely knowledge discovery in databases (KDD), also known as data mining (DM). It focuses on the principles, fundamental algorithms, implementations, and applications.
Comprehensive understanding and skills in data structures, such as linked data structures, B-trees, and hash functions.
Analysis of algorithms and time complexity.
Operating systems, main memory and disk management, file systems.
Elementary probability theory and statistics, such as random variables, distributions, probability mass functions, sampling, and statistical tests.
(Official textbook) Data Mining: Concepts and Techniques (3rd ed.), Jiawei Han, Micheline Kamber, and Jian Pei, Morgan Kaufmann, 2011.
The video lectures will be pre-recorded and posted at YouTube (in unlisted mode) with links provided on this webpage. You should view the video on or before the specified date.
The class will meet online every Thursday afternoon 3:30-4:20 pm using Zoom. We will discuss assignments and projects, run quizzes, and do office hour at the time.
We will use Piazza.
Getting to know your data [slides, video: part 1, part 2, Chapter 2, May 21]
Data preprocessing [slides, video: part 1, part 2, Chapter 3, May 28]
Business intelligence and data warehousing [slides, video: part 1, part 2, Chapter 4, part 3, part 4, Chapter 5, June 11]
Finding useful patterns and rules [slides, June 30]
Classification [slides, July 22]
Clustering analysis [slides, August 6]
Assignment 1, due at 11:59 pm May 31, 2020 (Sunday), covering introduction, getting to know your data, and data preprocessing. To access this assignment, you need the password that was distributed through email to all enrolled students.
Assignment 2, due at 11:59 pm June 21, 2020 (Sunday), covering data warehousing and OLAP. To access this assignment, you need the password that was distributed through email to all enrolled students.
Assignment 3, due at 11:59 pm July 10, 2020 (Friday), covering pattern mining. To access this assignment, you need the password that was distributed through email to all enrolled students.
Assignment 4, due at 11:59 pm July 24, 2020 (Friday), covering classification. To access this assignment, you need the password that was distributed through email to all enrolled students.
Assignment 5, due at 11:59 pm August 10, 2020 (Monday), covering clustering. To access this assignment, you need the password that was distributed through email to all enrolled students.
Course project, due at 11:59 pm, August 10, 2020. A description of a series of data sets that may be used in the project.