CS 328

Course Description

CS328 aims to introduce students to the various statistical and algorithmic ideas that underlie the field of data science. At the end of the course, students will be familiar with algorithms to extract patterns from medium to large data sets. They should also be comfortable in the various model statistical model fitting techniques as well as understand how to argue about statistical significance. Students will also get exposed to practical tools (e.g. numpy/scipy/matplotlib/sklearn if using python, or analogous tools in other languages). This course will be valuable to anyone interested in continuing studies/working in data analytics or machine learning.

Logistics:

Instructor: Anirban Dasgupta, Office: AB 13/405G. Please email for appointment.

Teaching Assistants : Shrutimoy Das, Yash Sahu.

Class : Offline as per time table. Communication will be done via Teams.

Grading policy is roughly as follows. Will be finalized in the first couple of lectures.

Assessment 1, 2: 25 + 30 = 55%
Homeworks: 10%
Project: 25% . It will have components of regular updates, presentations, and report writing.
Writing component: 10%. The goal of this component is to give an experience of the important role that communication plays in data science. Details will be shared. This will be coordinated by the writing studio.

We will strictly follow the honor policy of IITGN. Collaboration in homeworks is allowed unless stated explicitly. Collaborations among class participants is allowed in homeworks, but everyone needs to write down their own answers and code as well as generate their own plots. Anyone with you discuss ideas with when solving the homework needs to be mentioned clearly. You are not expected to use Google or any other source for finding answers to homework questions unless explicitly allowed.

Communication Channel

Please join the Teams grooup.

Textbooks and References:

Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan. Published by Cambridge University Press.
- Pre-publication draft.
Advanced Data Analysis from an Elementary Point of view, by Cosma Shalizi.
Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, John DeNero, David Wagner.
Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014. Cathy O’Neil and Rachel Schutt.
Other references will be distributed via Teams channels.

Background Material

Gilbert Strang's lectures on linear algebra.
3Blue1Brown's series on linear algebra.
Basics of probability and distribution-- from Mathematics for Machine Learning book.
Probability and statistics course from MIT-OCW.
Python for Data Science book by Jake Vanderplas.

Homeworks

There will be 4-5 homework assignments, spread out throughout the semester. Homeworks will be a mix of theoretical and implementation oriented questions.

Lecture Schedule:

There are roughly 40 lecture hours in the calendar. The following is a tentative order in which the topics will be covered.

Foundations
1. Data representation, distance measures
2. Central limit theorem
3. Random variables and tail inequalities, hashing, balls and bins.
4. Practical example of hashing-- MinHash and its applications
Clustering and low-rank approximations
1. k-means, k-center, Lloyds algorithm, k-means++
2. Clustering in graphs -- expansion, conductance, modularity, k-core
3. Spectral algorithms for expansion and conductance.
4. Louvain algorithm for modularity.
5. Latent variable models -- Gaussian mixture models
  1. Low rank/dimensional embeddings and their uses -- SVD, other matrix factorizations, tSNE
Patterns in big data
1. Efficient data summaries -- Bloom filters, bit arrays.
2. Streaming model: samples and sketches -- reservoir sampling, counting distinct elements, heavy hitter data structures (Misra-Gries, Count-Min, Count-Sketch)
3. Mining frequent itemsets -- Apriori, Eclat algorithms (if time permits)
Drawing inferences from the data
1. Sampling, estimation, confidence intervals, bootstrapping
2. Hypothesis testing and its variants-- multiple hypothesis testing, Bayes Factor
3. Linear regression and its generalizations, model evaluation, goodness of fit tests
4. Introduction to design of experiments

Report abuse