Class Journal

Tuesdays & Thursdays (data science)

Tue Sep 24. Introduction to the course.

Thu Sep 26. Gentle introduction to linear regression (slides). Notation and naming convention. Hypotheses and measuring the error. Residuals. First problem formulation. Absolute residuals and absence of a unique solution. Squared errors and SSE aka quadratic loss. The loss/SSE as a function of the parameters. Quadratic forms and uniqueness of the min-SSE hypothesis. Regression on Ashenfelter's wine dataset: sep26.Rmd and wine.csv. Goodness of fit: the sum of total squares (SST) and the R^2. Introduction to p-values.

Supporting material: (1) The Analytics Edge & online course, (2) "Introduction to Applied Linear Algebra", chapter on Regression

Thu Oct 3. Linear regression reloaded. Linear model with independent Gaussian noise. Rejection of null hypotheses and p-values. Example with coin tosses. Example with a dataset from the model. R script: pvalue.Rmd. (No) general relationship between R^2 and p-values (see the script again). Maximum likelihood estimators (MLE). Derivation of MLE for Bernoulli random variables (coin tosses). Derivation of MLE for a linear model with independent Gaussian noise. Connection with OLS (ordinary least squares). Slides.

Supporting material: (1) The Analytics Edge & online course, (2) notes by Ernie Croot (GeorgiaTech) on Maximum Likelihood and Linear Regression: http://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf

Tue Oct 8. Overfitting (slides). Toy datasets: train.csv, test.csv. Script: bv.Rmd. Quadratic regression. Polynomial regression. Degree of the polynomial model and SSE. Unisolvence theorem. Degree of the polynomial model and SSE on the test set. Explaining the training and test SSE curves. Bias and underfitting. Variance and overfitting. The bias-variance (underfitting-overfitting) tradeoff (see an overview on wikipedia). Overfitting and p-values.

Thu Oct 10. Bias-variance decomposition (slides and a simple derivation). Bias-variance in R (script). OLS in vector notation. Geometric interpretation of OLS (slides). The predictors as projection on the feature space. Solution of OLS via solution of a linear system. Doing it in R (script). Multicollinearity and absence of unique minimum (script). Gradient descent (see the slides above).

Supporting material: "Introduction to Applied Linear Algebra", chapter on Regression

Tue Oct 15. Classifiers and logistic regression. Case study: the Framingham study. Creation of a dataset and identification of the independent variables. Logistic regression. Evaluation of classifiers, choice of a threshold value, ROC curves, area under the ROC curve. Notes on maximum likelihood estimation. Slides and the R-script practical (w/ data) are available in the Google group discussion.

See also the EdX course.

Thu Oct 17. Naive Bayes classification. The Bayes classifier. The Naive Bayes model, assumption and training. Laplace smoothing. Numerical stability. Notes on discriminative and generative techniques. Slides and the R-script practical (w/ data) are available in the Google group discussion.

See also the EdX course.

Tue Oct 22. Clustering. (slides) Defining clustering. Similarity and distance. Number of clusters. Cluster quality measure. Euclidean distance from the centers. The k-means problem. Brute-force enumeration. Greedy improvements and LLoyd's iterative algorithm. K-means in R on the iris dataset. Script: kmeans.Rmd.

See also the EdX course, and "Introduction to Applied Linear Algebra".

Thu Oct 24. Clustering (folder with slides, code, etc). Principal Component Analysis and visualization. K-medians and k-centers. Law of total variance. Explained variance. The elbow method. Hierarchical (agglomerative) clustering. Single linkage, complete linkage, Ward's methods. Cutting the hierarchy tree. Exercises in R: iris, movies, languages. Kleinberg's impossibility theorem for clustering. Correlation clustering.

See also the EdX course, and "Introduction to Applied Linear Algebra".

Tue Oct 29. Kaggle Competitions. Titanic: machine learning from disaster. Public kernel: https://www.kaggle.com/marcobr/my-first-titanic.

Thu Oct 31. Kaggle competitions. Titanic: machine learning from disaster. Public kernel: https://www.kaggle.com/marcobr/my-first-titanic.

Tue Nov 5 (slides). Introduction to algorithms and complexity. Problems and functions. Computability. Uncomputable functions (see the halting problem with a simple proof). Algorithms. Elementary operations. Worst-case running time. Big-oh notation.

See also Ch. 2 of "Algorithm Design"

Tue Nov 12 (slides). NP-completeness. Three problems: Independent set, Set cover, Knapsack. The class P. Why "polynomial = easy". Certificates and verifiability. The class NP. The P versus NP problem. Possible scenarios. Reductions. Reduction of INDSET to CLIQUE. Poly-time reductions. NP-complete problems. 3-SAT is NP-complete. Reduction from 3-SAT to INDSET. What makes NP-complete problems hard?

Thu Nov 14 (slides). Linear programming. Google's AdWords. Greedy doesn't work. Combinatorial optimization. Decision variables, objective function, constraints. Feasible region. Half-spaces, polytopes. Convex sets, convex functions. Local minima, global minima. Key properties. The simplex method. Klee-Minty cube. LP is in polynomial time.

Some notes by James Aspnes. See also the EdX course.

Tue Nov 19. Linear programming. Exercises. The three cases: infeasible, bounded, unbounded. LPs in matrix form and canonical form. Formulation of knapsack and of set cover as an integer program. Non-convexity and NP-completeness of IP. Script: class.Rmd.

Some notes by James Aspnes. See also the EdX course.

Thu Nov 21. Kaggle Home Credit competition. First submission: class1.py. AUC and ranking.

Tue Nov 26. Tikhonov regularization. Coefficients of an overfitting polynomial regression model (ridge_coefs.Rmd). Scaling/normalization. Adding a penalization term. Probabilistic interpretation with the likelihood. Matrix notation and multicollinearity. Application to the Home Credit challenge (ridge.py).

See also "Introduction to Applied Linear Algebra", and this nice post by Brian Keng.

Thu Nov 28. MinHash. Document similarity. Bag of words. Jaccard index. Computing the Jaccard index. The MinHash scheme. Proof that MinHash gives an estimator for the Jaccard index. Complexity of MinHash. Some concise notes.

See chapter 3 of Ullman, available at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf.

Tue Dec 3. Huffman coding. Frequency-based compression and variable-length encoding. Decoding and prefix-free codes. PFC as rooted trees. Finding the optimal tree. Top-down greedy algorithm. Huffman algorithm. Proof of optimality. Slides and notes.

Thu Dec 5. Burrows-Wheeler transform. Run-length encoding. The Burrows-Wheeler transform. Building the table. Inverse transform. Searching for a pattern. First-last property. Why repeats are turned into runs. Slides and an online BW script.

See also https://www.coursera.org/lecture/algorithms-on-strings/burrows-wheeler-transform-GAA6S

Tue Dec 10. Gradient descent. Gradient descent, application of gradient descent to linear regression, learning rate and convergence, stochastic and batch gradient descent, gradient descent with momentum. Slides and the Python script practical are available in the Google group discussion.

Thu Dec 12. Introduction to Computer Vision: sample research and applications of computer vision, image formation, case studies in image recognition and 3D reconstruction, filtering and convolutions, blurring and sharpening, filtering with Gaussians and derivative of Gaussians. Slides and the Python script practical are available in the Google group discussion.

Tue Dec 17. PCA. Approximating data with a single feature. Projections. Sum of squared residuals. From minimization to maximization. Scaling. The correlation matrix. Maximizing a quadratic form under constraints.

Thu Dec 19. PCA. Eigenvectors and eigenvalues. Positive-semidefinite matrices. Orthonormal bases of R^n from eigenvectors of PSD matrix. Maximizing the variance of the projections. The first eigenvector. Explained variance. Extension to k components. Link between k-PCA and k-means.

Some concise notes. Python script.

Fridays (programming)

Fri Sep 27. First steps in Python. Data types (int, str, float, list, ...). Variables and assignments. Reading the input with input(). Expressions and their evaluation. Operators and data types. Precedence of operators. Conditional statements (if, else). Lists: creation, access, indexing. Loops (for ... in ...). Appending to and reversing a list. The range(n) function. A simple primality testing algorithm. Scripts: sep27.zip. Supporting material: https://docs.python.org/3/tutorial/

Fri Oct 4. Functions. Function arguments and return values. A faster prime checker. Finding the longest run of consecutive characters in a string. Conditional expressions. Function arguments and local variables. Jupyter notebook.

Fri Oct 11. Writing a linear algebra library. More on lists. List indexing and slicing. Nested lists. List comprehension. Conditional list comprehension. Nested list comprehension. Modules. List of exercises. Class Python script and notebook. See the relevant sections of the Python Tutorial.

Fri Nov 8. Opening, reading, writing files. Counting letters. Dictionaries. Scripts here.

Fri Nov 15. Classes and objects. The constructor. Attributes: data attributes and methods. Writing a counter class. Source: http://collabedit.com/her8g.

Fri Nov 22. Hierarchical clustering. Computing the distance matrix. Creating a base class and a derived class. Skeleton of the module: libcluster.py.

Fri Nov 29. Recursive functions. Factorial. Fibonacci. Combinatorial explosion. Memoization using dictionaries. Recursive version of max, palindrome, reverse. The Mergesort sorting algorithm.

Fri Dec 6. Naive Bayes. Scripts and data here.