CS328 aims to introduce students to the various statistical and algorithmic ideas that underlie the field of data science. At the end of the course, students will be familiar with algorithms to extract patterns from medium to large data sets. They should also be comfortable in the various model statistical model fitting techniques as well as understand how to argue about statistical significance. Students will also get exposed to practical tools (e.g. numpy/scipy/matplotlib/sklearn if using python, or analogous tools in other languages). This course will be valuable to anyone interested in continuing studies/working in data analytics or machine learning.
Instructor: Anirban Dasgupta, Office: AB 13/405G. Please email for appointment.
Teaching Assistants : Jayesh Malaviya, Shubhajit Roy
Class : Offline as per time table. Communication will be done via Teams (link).
Grading policy is roughly as follows. Will be finalized in the first couple of lectures.
Assessment 1, 2: 25 + 30 = 55%
Homeworks: 10%
Project: 25% . It will have components of regular updates, presentations, and report writing.
Writing component: 10%. The goal of this component is to give an experience of the important role that communication plays in data science. Details will be shared. This will be coordinated by the writing studio.
We will strictly follow the honor policy of IITGN. Collaboration in homeworks is allowed unless stated explicitly. Collaborations among class participants is allowed in homeworks, but everyone needs to write down their own answers and code as well as generate their own plots. Anyone with you discuss ideas with when solving the homework needs to be mentioned clearly. You are not expected to use Google or any other source for finding answers to homework questions unless explicitly allowed.
Please join the following Teams group.
Foundations of Data Science by Avrim Blum, John Hopcroft, Ravi Kannan. Published by Cambridge University Press.
Advanced Data Analysis from an Elementary Point of view, by Cosma Shalizi.
Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, John DeNero, David Wagner.
Doing Data Science, Straight Talk From The Frontline. O’Reilly. 2014. Cathy O’Neil and Rachel Schutt.
Other references will be distributed via Teams channels.
Basics of probability and distribution-- from Mathematics for Machine Learning book.
Probability and statistics course from MIT-OCW.
Python for Data Science book by Jake Vanderplas.
There will be 4-5 homework assignments, spread out throughout the semester. Homeworks will be a mix of theoretical and implementation oriented questions.
There are roughly 40 lecture hours in the calendar. The following is a tentative order in which the topics will be covered.
Foundations
Data representation, distance measures
Central limit theorem
Random variables and tail inequalities, hashing, balls and bins.
Practical example of hashing-- MinHash and its applications
Clustering and low-rank approximations
k-means, k-center, Lloyds algorithm, k-means++
Clustering in graphs -- expansion, conductance, modularity, k-core
Spectral algorithms for expansion and conductance.
Louvain algorithm for modularity.
Latent variable models -- Gaussian mixture models
Low rank/dimensional embeddings and their uses -- SVD, other matrix factorizations, tSNE
Patterns in big data
Efficient data summaries -- Bloom filters, bit arrays.
Streaming model: samples and sketches -- reservoir sampling, counting distinct elements, heavy hitter data structures (Misra-Gries, Count-Min, Count-Sketch)
Mining frequent itemsets -- Apriori, Eclat algorithms (if time permits)
Drawing inferences from the data
Sampling, estimation, confidence intervals, bootstrapping
Hypothesis testing and its variants-- multiple hypothesis testing, Bayes Factor
Linear regression and its generalizations, model evaluation, goodness of fit tests
Introduction to design of experiments