Machine Learning

Machine Learning Lecture Series (M.Sc. in Engineering Mathematics, Batch 2020 - 2022)

Lecture I (2 Hours, 12th July 2021) (Topic: Introduction to Bayesian Statistics)

1) Introduction to Bayesian Inference and difference with classical inference. Under the Bayesian setup, probability describes the degree of belief, not the limiting frequency of the experiments.

2) Prior distribution and Posterior distribution and likelihood function. Posterior is proportional to the product of likelihood and prior.

3) Computation of the posterior distribution analytically and we can ignore the computation of normalizing constant (but take caution!). There can be situations where we require explicit computation of the normalizing constant.

4) Bayesian point estimates, posterior mean and posterior mode; Posterior interval. Interpretation of posterior interval and confidence interval and their difference with respect to the long-term frequency definition.

5) Posterior mean is a convex combination of the prior mean and sample mean. As the sample size n is large, the posterior mean is approximately equal to the sample mean. The posterior density function is a conditional distribution, conditional upon the data. The posterior distribution is the distribution of the parameter on the parameter space after the update of prior done using the Bayes theorem.

Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to the study material: Click Here (Several worked out examples are given and comparing Bayes estimators are also discussed)


Lecture II (2 Hours, 15th July 2021) (Topic: Introduction to Bayesian Statistics)

1) Properties of Posterior Bayes estimator. We observed that Posterior Bayes estimators are always biased. If it turns out to be unbiased then its variance must be equal to zero (proof done in class), which means it estimates the true parameter accurately with probability one.

2) We have computed the posterior estimates of the functions of parameters as well. g(p) =p(1-p) and g(p)= log(p/(1-p)) for Bernoulli(p) . Many times, it is analytically complicated to obtain the exact distribution, then the simulation is a handy way to approximate the exact density function by using a histogram. However, for simple cases, you can always apply the MGF, Characteristic function, transformation, or using definition itself, you can obtain the exact posterior density function of the functions of the parameter.

3) Posterior Bayes estimators turn out to be a function of the data through a statistic (function of the sample), and for every example, we observe that it is, in fact, a function of sufficient statistics (more precisely, minimal sufficient statistic). The connection is deep, however, intuitively the idea is very simple, Bayes estimator also contains the complete information about the parameter from the sample in the most possible condensed way.

4) Flat priors, Proper and Improper priors. If the prior is improper, still we can continue the investigation under the Bayesian setup if the posterior is proper. So, you must always check whether the posterior is proper (for improper prior), however, if prior is proper, posterior will always be proper.

5) Concept of Conjugate Prior. The idea is simple. For a parameter, both prior and posterior belong to the same family of density functions. You must work out all the exercises to make yourself comfortable.

Exercise 1: Show that the inverted gamma family is the conjugate family for σ2 for the normal distribution with mean μ and variance σ2. Do the explicit computation as shown in the class. (Solution)

Exercise 2: Simulate observation from the posterior distribution of g(p) (given in (2)). You can use rbeta() function from R software for statistical computing.


Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to the Text Books: (Chapter 11, All of Statistics by Larry Wasserman, Section 11.4) (Chapter 7, Introduction to the Theory of Statistics by Mood, Graybill and Boes, Chapter VII, Section 7, pp 340)

Lecture III (2 Hours, 19th July 2021) (Topic: Jeffrey's Prior and Fisher Information)

1) Flat priors are not transformation invariant. If we do not have any information about the parameter, then we have no information about the function of the parameter as well. We observe that if f(p) = 0 for 0<p<1, then the density function of g(p) may not be flat.

2) The idea is that the uniform in the parameter space does not induce uniform distribution on the model space. Jeffrey's prior is a prior distribution on the parameter space, that induces a uniform prior on the model space. f(ϴ) is proportional to the square root of I(ϴ), which is the expected Fisher Information.

3) Computation of Jeffrey's prior. We observe that most of the time, Jeffrey's prior is improper.

4) Large sample properties of Bayes estimator and Bayesian Delta method.

5) Computation of joint posterior distribution in a multi-parameter setup. Computation of marginal posterior density function.

6) Simulation of posterior density of function of parameters under multiparameter setup. Doing posterior inference by simulation, compute posterior mean and posterior interval using R.


Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to the R Codes: Click Here

Link to the Text Books: (Chapter 11, All of Statistics by Larry Wasserman, Section 11.5 and 11.7)

Link to Extra Notes on Jeffrey's Prior: Click Here (This study was done by Urbi Dutta, M.Sc. Engineering Mathematics 2018-2020)


Lecture IV (2 Hours, 22nd July 2021) (Topic: Posterior Estimation and Bayesian CLT)

1) Estimation of Posterior Mode, Posterior Mean, and Posterior Median. Finding the MAP estimator is an optimization problem, whereas the posterior mean estimator is an integration problem.

2) If the prior is flat then MAP and MLE are similar. General-purpose algorithms for MAP and Posterior mean estimation are discussed.

3) Bayesian CLT and its verification using R. It is to be noted that Bayesian CLT can only be applied when the sample size is large. For small sample or even for moderate sample size also, it should not be applied

4) Multivariate Newton Raphson method revision and all are asked to connect the Hessian Matrix with the Covariance Matrix of MLE.


Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to the R Codes: Click Here


Lecture V (1 Hour, 26th July 2021) (Topic: Posterior Estimation: Laplace Approximation and Gibbs Sampler)

1) Computation of posterior estimates using Laplace approximation. Under some regularity conditions, the integrals with respect to the posterior density are approximated by Gaussian distribution.

2) Simulation of joint density posterior density function using Gibbs Sampler. The idea is to use simulation from conditional univariate density functions and generate a Markov chain for all the parameters. After the burn-in period, the values in the individual chain behave as a realization from the marginal posterior density and when all chains are considered together, they act as a realization from the joint posterior density function.

3) For implementation, refer to the Applied Statistics Lecture V. Using conditional distribution, we have simulated realizations from the bivariate normal distribution. A very useful algorithm and is a part of many machine learning techniques.

Exercise 1: If we have a random sample of size n from bin(1,p) distribution and consider Jeffrey's prior for p, that is beta(0.5, 0.5) density function, then the posterior density of p|y is given as beta(y+0.5, n-y+0.5), where y = sum(x) = number of success. The posterior expectation E(p|y) = (y+0.5)/(n+1). Approximate the posterior expectation E(p|y) using the Laplace approximation and obtain the approximation error explicitly.

Exercise 2: Let X~Poisson(eλ) with known e, and assume that λ ~ Gamma(α,β). (a) Compute the posterior expectation of λ. (b) Compute the Laplace approximation of this posterior expectation. (c) For α = 0.5 and β = 0, compare the Laplace approximation with the exact value, given the observations x = 11, e = 3.04, or x = 110 and e = 30.4. Also, compute the relative error of the Laplace approximation. (d) Consider θ = log(λ). Find the posterior density function (exact) for θ (transformation formula). Compute the Laplace approximation of the posterior expectation of θ. Compare this with the exact value that you have obtained by using numerical integration using R/Python. (You can use integrate() function in R).


Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to the R Codes: Teaching -> Applied Statistics -> Lecture V


Lecture VI (2 Hours, 29th July 2021) (Topic: Bayes Rule and Minimax Rule)

1) Introduction to the loss function as a discrepancy between the parameter and estimators and examples of different loss functions; squared error loss; absolute error loss; Lp loss; zero-one loss; Kullback-Liebler loss.

2) Computation of Expected Loss or the Risk function as a function of the parameter. Comparing risk functions and the need for a one-parameter summary of the risk to compare estimators.

3) Definition of Maximum Risk (supremum over all possible values of parameter) and Bayes Risk (expected risk under prior density) and their motivation in use. A comparative study with examples has been discussed.

4) Definition of Bayes Rule and Minimax Rule. Bayes Rule minimized the Bayes Risk, whereas Minimax Rule minimizes the maximum risk.

5) We have defined the term Posterior Risk (expected risk under posterior density) and found an integral equation that connects it with Bayes Risk. We derived that the Bayes Rule minimizes the Posterior Risk as well. So, this we have considered as an algorithm to obtain Bayes Rule under the given loss function.


Link to the Video: Click Here

Link to the Writing Notes: Click Here


Lecture VII (2 Hours, 2nd August 2021) (Topic: Empirical Bayes)

1) If the parameters in the prior distribution (hyperparameters) are not known, then the parameters may be estimated from the marginal distribution of the data. In this case, parameters in the prior distribution are replaced by these point estimates and continue doing the Bayesian analysis.

2) The uncertainty associated with the learning of hyperparameters is not considered in the further analysis, this is a drawback of the Empirical Bayes.

3) However, in hierarchical Bayes, the unknown prior parameter is replaced by a distribution. Hence the uncertainty in the hyperparameters gets included in the final estimation of the posterior mean. This is an advantage of hierarchical Bayes over empirical Bayes.


Link to the Video: Click Here

Link to the Writing Notes: Click Here


Lecture VIII (2 Hours, 2nd August 2021) (Topic: Simple Linear Regression and Geometric Ideas)

1) We have observations on two random variables (X, Y), with Y as response and X as the predictor. The Regression function r(x) is the conditional distribution of Y, given X = x. All the discussions in the lecture are basically on the conditional distribution of Y given X.

2) The aim is to approximate the regression function using some parametric function. When r(x) =β0 + β1*x, then it is simple linear regression. Simple linear regression is written as Yi = β0 + β1*Xi + εi, where εi is the unexplained error component with mean zero and variance σ2. Essentially there are three parameters (β0, β1, and σ2) that need to be estimated from the data. We obtain that the least-squares estimators are unbiased and also obtain the variance-covariance matrix.

3) least square estimates of the parameters and showing that the estimator is unbiased. Computed the variance-covariance matrix of the estimator.

4) Geometrical consideration and projection of random vector Y on the linear space generated by the predictors and its orthogonal complement. Necessary geometric ideas and division of the degrees of freedom are discussed.


Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to R Handouts: Click Here (Program for the sum of squares minimization using R, general-purpose optimizers in R also discussed)


Lecture IX (2 Hours, 9th August 2021) (Topic: Multiple Linear Regression and Real data analysis, z-score and p-value, Inference)

1) Concept of multiple linear regression. Extension from the simple linear regression and obtaining estimates and uncertainty associated with them. In two dimensions, the estimated model will be a plane and in the p dimension, it will be a hyperplane.

2) Estimating coefficients using software using the least-squares method. Equivalency of least-squares and maximum likelihood estimation method under the normality assumption of the error.

3) Inference about features using z-score and investigating p-value. The sampling distribution of estimators can be well approximated by normal distribution for large samples, hence a Wald confidence interval is also discussed.


Link to the Video: Click Here

Link to the Writing Notes: Click Here

Link to R Handouts: Click Here (Fitting multiple linear regression using R and using direct formula) (Link to Prostate Cancer Data)

Link to Python Handouts: Click Here


Lecture X (2 Hours, 12th August 2021) (Topic: Linear Regression and Nearest Neighbour Method)

1) Some examples of regression problems and classification problems. Examples of multiclass classification problems: Hand-written digit classification, Iris data.

2) Classification problem and the decision boundary (Simulated example)

3) K-Nearest Neighbour Methods for regression and classification problems. (R functions: knnreg and knn3 from caret package in R).

4) Decision theoretic framework and minimization of Expected Prediction Error (EPE(f)), where Y = f(X) + e is the underlying true relationship between X and Y. Comparison of least squares and nearest neighbor methods in approximating the function f.

Exercise 1: Create Markdown files in python for the data exploration of the following problems: ISLR, Version 2, Chapter 2, Problem No. 8 (College Data), Problem No. 9 (Auto Data), Problem No. 10 (Boston Housing Data) (page 54 - 57)

Exercise 2: Click on the Link to ML Handouts, and understand the behavior of the decision boundary for different values of k. In the given simulation, it seems that the classes are linearly separable. Create a dataset by simulation, where the classes are not linearly separable and do the same experiment, and check out the behavior of decision boundary as a function of k.

Exercise 3: In the ML Handouts, the simulation experiment is shown for a classification problem. You create a simulation example for the regression problem and check the fitting for as a function of k. For the regression problem, you consider the simulation of data set using the following population regression line for Y= f(X) + e, where (a) Y = 1 + 2*x + N(0, 0.5), and (b) Y = 1 + 2*x^2 + N(0,0.5). Try to connect these figures with the concept of the trade-off between bias and variance.

Link to the Video: Click Here

Link to the Lecture Slides: Click Here

Link to ML Handouts: Click Here (How the decision boundary changes as a function of K, for K nearest neighbors method)


Lecture XI (2 Hours, 19th August 2021) (Topic: Expected Prediction Error and Universal Behavior of Train and Test Error)

1) Expected Prediction Error and Its decomposition into reducible and irreducible error components. The reducible component can further be broken into Squares Bias and Variance.

2) Model selection and the bias-variance tradeoff.

3) Understanding the universal behavior of training and test error as a function of model flexibility. Simulation has been carried out using R and look at the ML Handouts for the codes.

Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to ML Handouts: Click Here (Simulating the universal behavior of training and test error)


Lecture XII (2 Hours, 23rd August 2021) (Topic: Multicollinearity and Model selection - I)

1) Multiple Linear Regression by successive orthogonalization of the predictors. Generating orthogonal basis for the column space of the data matrix X and QR decomposition.

2) Concept of Multicollinearity and the Variance Inflation Factor (Computation of RSqure included).

3) Concept of Model selection in the context of multiple linear regression, Mallow's Cp statistic.

Link to the Video: Click Here

Link to Writing Notes: Click Here


Lecture XIII (2 Hours, 26th August 2021) (Topic: Model selection - II)

1) Unbiased Estimator of Expected Test Error. Mallows's Cp as an estimator of expected test MSE having the form: Training Error + Bias correction.

2) Evolution of Akaike Information Criteria as the goodness of fit minus the model complexity. Similarly BIC as a model selection criteria derived as a goodness of fit minus the model complexity. Both AIC and BIC penalizes the likelihood evaluated at MLE, but BIC puts more penalty (for n large) and tends to select a simpler model than AIC.

3) If we consider a uniform prior on the model space, then the model with the largest BIC is the same model that has the highest posterior probability on the model space given the data.

4) Mallow's Cp, AIC, and BIC try to put various bias corrections to the training error and try to estimate the expected test error. However, there are direct methods to estimate the expected test error. We have discussed here the Validation set approach and K-fold cross-validation along with Leave One Out Cross Validation (LOOCV).

5) If we consider the definition of AIC = -2*log-likelihood + |S|, then we have to choose the model that has the minimum AIC. This formula is commonly implemented in various ML packages so that we select the model that has the smallest AIC. So, you need to be careful and suggested checking the package's manual. In addition, if we have the least-squares method, then AIC and Mallow's Cp have exact same expression after curtailing irrelevant constants.

Exercise 1: Assume a linear regression model with normal errors. Consider σ to be known. Show that the model with the highest AIC (Likelihood - |S|) is the model with the lowest Mallow's Cp statistic.

Exercise 2: Consider the cancer data and consider all possible models considering the various combinations of predictors. For each of the models, compute the AIC, BIC, and Mallow's Cp and find the best model.

Exercise 3: Consider a model y = f(x) + e, with f(x) = 0.1 + 2*sqrt(x) -sin(x), and sigma = 0.3 (the population relation). Simulate a data set of size n=100. Fit polynomials of varying degree (1:10) and each case compute the AIC. Check whether AIC is selecting the same model as selected by the K-fold cross validation.

Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to Short Note on AIC: Click Here


Lecture XIV (2 Hours, 7th September 2021) (Topic: Variable Selection and L2 Regularization)

1) Variable selection: Best subset selection, Forward Selection, and Backward Selection. Hybrid selection is also there, which is a combination of the forward and backward selection but it has not been discussed in the class.

2) Ridge regression. Interpretation of the complexity parameter and geometrical ideas using constrained optimization. Ridge solution of the least square estimation problem and ideas related to incorporating bias to reduce the variance substantially.

3) Computing the coordinates that experience the maximum shrinkage. Singular value decomposition and its geometrical interpretation in explaining shrinkage.

Exercise 1: Consider the columns of the centered data matrix are orthonormal, then obtain the relationship between the least-squares estimates and the ridge estimate of the regression coefficients.

Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to Tutorial: Click Here

Link to the textbook: ISLR Version 2, Chapter 6 (Model selection and regularization) Section 6.1 and Section 6.2.1.


Lecture XV (2 Hours, 13th September 2021) (Topic: Lab Session Using R)

1) Best subset selection, Forward Selection, and Backward Selection

2) Ridge regression, Lasso, and Elastic Net.

Link to the Video: Click Here

Link to R Codes: Click Here

Link to Python Codes: Click Here (Thanks to Khushboo Agarwal)

Link to the textbook: ISLR Version 2, Chapter 6 (Linear Models and regularization) Section 6.5 and Section 6.2.1.


Lecture XVI (2 Hours, 20th September 2021) (Topic: Linear Methods of Classification: General Overview)

1) Linear Regression approach to a classification problem, decision boundary, Indicator Response Matrix, Connection to multiple linear regression, making predictions.

2) If K (the number of classes) is large, due to the rigid nature of the regression models, classes can be masked by others. Some classes are never going to dominate. In addition, the prediction of the probability of P(G=k|X=x) can be negative or more than one as well, especially if we make predictions outside the hull of the training data.

2) Linear Discriminant Analysis models the posterior probability P(G=k|X=x) when the class-specific density and the prior probability of each class are given. Each class-specific density is considered to be multivariate normal density with a common covariance matrix for all classes. Whereas, in Quadratic Discriminant Analysis, different classes may have different covariance matrices.

Exercise 1: In LDA and QDA, how many parameters need to be estimated? Explicitly count and write the numbers. The number of inputs is p with K class labels.

Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: ESL, Chapter 4 (Linear Methods for Classification).


Lecture XVII (2 Hours, 23rd September 2021) (Topic: Linear Methods of Classification: Computation)

1) Connection between Ridge and Lasso Idea in Regularized Discriminant Analysis. Regularization of the covariance matrix is done by using a convex combination of the class-specific covariance matrix and pooled covariance matrix. The regularization parameter can be chosen using cross-validation.

2) Eigen decomposition of the covariance matrix and computational considerations of LDA and QDA.

3) Estimation of parameters for the logistic regression model, maximum likelihood estimation. Multivariate Newton Raphson method.

Exercise 1: Show that the maximum likelihood estimator of the parameters of the logistic regression model using the Newton Raphson method is consistent. That means, as the sample size increases, the estimates converge to the true value.

Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: ESL, Chapter 4 (Linear Methods for Classification).


Lecture XVIII (2 Hours, 28th September 2021) (Topic: Multivariate Normal distribution)

1) For making an analytical comparison of the LDA, QDA, Logistic Regression, and Naive Bayes, understanding the Multivariate Normal distribution is very important. Hence, a couple of lectures are devoted to understanding the properties of multivariate normal distribution.

2) Gentle Introduction to special matrices: Properties of Positive definite matrix, Square Root Matrix, Random vectors, and Random matrices, Partitioning of the covariance matrix, sample variance, Cauchy Schwarz inequality and its generalization, maximization of quadratic forms

3) Multivariate Normal distribution and constant density contours

Exercise 1: Write a program to compute the square root matrix of a positive definite matrix. Write it on your own.

Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: Johnson and Wichern (1998) Applied Multivariate Statistical Analysis. 4th Edition, Prentice-Hall, New York.


Lecture XVIII (2 Hours, 30th September 2021) (Topic: Geometry of Multivariate Data)

1) Geometry of multivariate data in a euclidean space. The sample means vector and sample variance-covariance matrix and unbiased estimator of the population variance-covariance matrix.

2) Concept of Generalized Sample Variance and associated geometrical ideas.

3) Connection between |R| and |S|.


Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: Johnson and Wichern (1998) Applied Multivariate Statistical Analysis. 4th Edition, Prentice-Hall, New York.


Lecture XIX (2 Hours, 4th October 2021) (Topic: Sampling from Normal Distribution)

1) Results related to multivariate normal distribution

2) Maximum Likelihood estimation of the population mean and population covariance matrix when sampling from the multivariate normal distribution

3) Wishart distribution


Link to the Video: Click Here

Link to Writing Notes: Click Here

Tutorial on Multivariate Normal distribution: Click Here

Link to the textbook: Johnson and Wichern (1998) Applied Multivariate Statistical Analysis. 4th Edition, Prentice-Hall, New York.


Lecture XX (2 Hours, 11th October 2021) (Topic: Principal Component Analysis)

1) Computation of the principal components

2) Principal component Regression


Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: Johnson and Wichern (1998) Applied Multivariate Statistical Analysis. 4th Edition, Prentice-Hall, New York.


Lecture XXI (2 Hours, 18th October 2021) (Topic: Analytical comparison of classification methods)

1) Naive Bayes algorithm

2) Analytical Comparison of logistic regression, LDA, QDA, and Naive Bayes

3) Implementation using R

Exercise 1: State the assumptions for Linear Discriminant Analysis (LDA) and show that LDA models the log odds of the posterior probabilities as a linear function of x.

Exercise 2: State the assumptions for Quadratic Discriminant Analysis (QDA) and show that QDA models the log odds of the posterior probabilities as a quadratic function of x.

Exercise 3: State the assumptions for the Naive Bayes algorithm and show that Naive Bayes models the log odds of the posterior probabilities as a function of x having a generalized additive model framework.


Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to ML Handouts: Click Here

Link to the textbook: ISLR, 2nd Edition, 2019.


Lecture XXII (2 Hours, 25th October 2021) (Topic: Nonparametric Curve Estimation)

1) Histogram Estimator, choice of optimal bin width

2) Kernel Density Estimator.

par(mfrow = c(2,2))

x = rnorm(10)

plot(density(x, kernel = "gaussian"), lwd = 2)

plot(density(x, kernel = "epanechnikov"), lwd = 2)

plot(density(x, kernel = "rectangular"), lwd = 2)

plot(density(x, kernel = "triangular"), lwd = 2)


Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: All of Statistics by Larry Wasserman.


Lecture XXIII (2 Hours, 25th October 2021) (Topic: Orthogonal functions and Nonparametric Curve Estimation)

1) Orthogonal and Orthonormal functions, Legendre polynomials, Inner product and norm

2) Estimation of density function f from L2(0,1)

3) Risk function and choice of optimal smoothing parameter (number of terms to be kept in the series expansion using cross-validation)

Exercise 1: Write a program to select the optimal choice of J (number of terms to keep in the series expansion) using cross-validation. Keep the density function and orthogonal function of your own choice.


Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: All of Statistics by Larry Wasserman.


Lecture XXIV (2 Hours, 25th November 2021) (Topic: Tree-Based Method)

1) Regression and Classification Tree

2) Recursive partitioning


Link to the Video: Click Here

Link to Slide: Click Here

Link to the textbook: ISLR, Version 2.


Lecture XXV (2 Hours, 30th November 2021) (Topic: Projection Pursuit Regression)

1) Projection Pursuit Regression

2) Comparison with Generalized Additive Model

3) Structure of the neural network


Link to the Video: Click Here

Link to Writing Notes: Click Here

Link to the textbook: ISLR, Version 2.


Lecture (1 Hour, 23rd November 2021) (Topic: Haar Wavelet and Function Approximation)

1) Father and Mother Haar Wavelets, Rescaling and shifting of Mother Haar Wavelet

2) Orthonormal basis of L2(0,1) and functional approximation using Haar basis

3) Doppler function and approximation by Haar basis functions. Computation of the coefficients using the inner product.

4) Simulation Study using doppler function as the population regression function with normally distributed errors.

5) Implementation of hard thresholding and estimation of coefficients and approximation of f using J resolution of Haar basis.

6) Exercises are given based on density function using Haar basis.


Link to the Video: Click Here

Link to R Codes: Click Here

Link to the textbook: Chapter 22, Smoothing using Orthogonal Functions, Section 22.4 in All of Statistics by Larry Wasserman.



Lectures on Statistical Learning Using Simulation

  • Review of Concepts related to Simple Linear Regression and Multiple Linear Regression. Emphasis will be on the regression diagnostics. Potential problems in applying linear regression (Non-linearity of the response-predictor relationships, Correlation of Error terms, Non-constant variance of error terms, Outliers, High-leverage points, Collinearity among the predictors (6 hours, This is a recall of the earlier concepts, by doing these assignments students are expected to get comfortable with R or Python)

  • Review of Probability Distribution (Mainly from a Simulation point of view) (Study Material from Applied Statistics - I course, MAT 2301) (2 hours, discussion only)

  • Introduction to Statistical Learning, Mean Square Error (MSE), Training Error, Test Error, Bias-variance trade-off, Measuring the quality of fit, Understanding the concept of model flexibility and prediction accuracy, Universal behavior of Training and Test MSE. Case study of linear regression with K-nearest neighbor regression (4 hours) (Study Material: Chapter 2 and 3 of ISLR)

  • Overview of Resampling Methods: Validation set approach, Leave-One-Out-Cross-Validation, K-fold cross validation, Bootstrap Estimation, Jacknife Estimation. Their utility and specific properties. (4 hours)(Study Material: Chapter 5, ISLR)

  • Model Selection and Regularization: Best subset selection, Forward Selection, Backward selection, Hybrid selection, shrinkage methods: Ridge regression, Lasso, Least angle regression (6 hours) (Chapter 6: ISLR)

  • Further Statistical Concepts-I: Convergence in Distribution and Convergence in Distribution (with implementation in R), Fisher Information and comparison of estimators, Delta method with worked examples, Second order delta method. (4 hours) (Study Material: Chapter 5, Statistical Inference by Casella and Berger, Chapter 5, All of Statistics by Larry Wasserman). All the concepts must be supported by R or Python codes

  • Further Statistical Concepts-II: Exponential family, Maximum Likelihood Estimation, Method of Moments, Best Unbiased Estimators (4 hours) (Study Material: Chapter 7: Section 2, Statistical Inference by Casella and Berger)

  • Further Statistical Concepts -III: Principal of Bayesian Statistical methods, Bayes Estimation, Empirical Bayes, Hierarchical Bayes. Creation of simulated case studies. Comparing Bayes estimator with the maximum likelihood estimator for the parameters of some common distribution functions (6 hours) (Study Material: Chapter 7, Statistical Inference by Casella and Berger)

  • Multivariate Normal Distribution and Its properties, Sampling from a multivariate normal distribution, Sampling distribution of Sample Mean and Sample Variance-Covariance Matrix. (4 hours) (Study Material: Chapter 4, Applied Multivariate Analysis by Johnson and Wichern)

  • Modelling with Categorical Response: Classification problem, Logistic Regression, Support Vector Machines, Multivariate Newton Raphson method, Receiver operating characteristic (ROC) curves, Area under the curve (AUC), and other related accuracy measures (4 hours)

  • Moving to Nonlinear Models: Polynomial regression, Step functions, basis functions, Regression splines, Smoothing splines, Local Regression, Generalized additive models. (8 hours) (Study material: Chapter 7: ISLR)

  • Unsupervised learning, Multivariate methods, Principal Component Analysis, Factor Analysis, Principal component regression, K-means clustering, Hierarchical Clustering, Multi-dimensional scaling, Canonical correlation analysis (6 hours) (Study Material: Chapter 8, 9, 10, 12 from Applied Multivariate Statistical Analysis and Chapter 10 from ISLR)


The references:

  1. Introduction to Statistical Learning with Application in R by James, G., Witten, D., Hastie, T. and Tibshirani, R.

  2. Statistical Inference by George Casella and Roger L. Berger.

  3. All of Statistics: A concise course on Statistical Inference by Larry Wasserman

  4. Applied Multivariate Statistical Analysis by Richard A. Johnson and Dean W. Wicharn

Much of the emphasis will go on Bayesian Statistics. Statistical simulation will play a pivotal role in all these studies. One has to be comfortable with R or Python.

Prerequisite course: Applied Statistics-I, Applied Statistics-II.

Assessment Process:

The following will be the process of assessment of the Machine learning course for 20 marks (continuous assessment)

  • All the assignments (updated on the web) should be carried out. Those who have worked on Python during summer, they will create the case study using Python. Those who have worked on R during summer, they will submit the assignments using R.

  • There will be two evaluations (10 marks each). One will be before the mid-semester and another will be before the end semester examination.

  • Assignment reports (printed copy) (3 marks) + Personal Interview (3 marks) + Presentation (4 marks).

  • Mid-semester examination and end-semester examination will be based on the theories taught in the class.


Machine Learning: Lecture - I (July 16, 2018) (01:30 PM - 03:30 PM)

The first three lectures would be emphasis on the understanding of regression concepts. Usually students understand the concepts related to regression theoretically and derive several results, like confidence intervals for the parameters. However, there are several scrutiny that has to be carried out before one finalizes the model. Here we will look into the matters more closely by artificially creating different scenarios and understand the problems associated with it. Suppose, if we ask a question, why the confidence interval for the regression coefficients are given based on the normal distribution.

x = seq(1,10, length.out = 20) # input variable

print(x) # printing the contents of x

n = length(x) # number of rows in the data set

y = rep(NA, n) # space for response variable

Specification of the population parameters for a simple linear regression model. Note that there is three parameters b_0 (intercept), b_1 (slope) and sigma (standard deviation of errors).

b_0 = 0.5

b_1 = 1.4

sigma = 0.3

While specifying the population parameter, many times people ask that how I have got an idea about them. And one the estimation process is done, it may not match with the exact value. Caution! The population parameters are never known. When we receive the data, whether it was generated from the linear regression situation or not, that is also not known to us. y = b_0 + b_1*x + N(0, sigma^2) is the population regression line. So based on a sample data we estimate the parameters of the population regression line. Since we do not have the real data in our hand, so we created artificial data and perform the regression analysis on it.

One has to understand the critical role of simulation to understand the statistical concepts. How the nature has played the role, we do not know it. We can only speculate it by collecting sample data and drawing some inference about our hypothesis regarding the functioning of nature. Whether our methodology leads us to true value or not, we do not know. In fact there is no way to validate it. The only possibility to check whether our method will work reasonably well, is to do it by simulation. We simulate data by fixing the population parameter. Then apply our method to estimate the population parameter and check whether it is getting close or not. Check the validity of the method by means of a large number of simulation. If it work nicely for almost all simulated scenarios, we are very confident that our method would work satisfactorily for a sample drawn from a population whose parameters are not known. Let us generate some artificial sample data to perform regression analysis. R is an excellent platform for simulation. I shall mainly do the things in R. But students are open to use both R or Python according to their personal choice. I shall avoid using specific functions from custom packages and try to hard code every concepts for understanding. With the help from the students, there will be another web page named "ML Python" which will contain all the codes that are executed here.

y = b_0 + b_1 * x + rnorm(n, 0, sigma)

print(y)

plot(x, y, col = "red")

# Consider three situations

par(mfrow=c(2,2))

y = b_0 + b_1*x # fixed line

plot(x,y, col = "red", main = expression(paste(sigma," = ", 0)))

y = b_0 + b_1*x + rnorm(n, 0, 1) # small sigma

plot(x,y, col = "red" , main = expression(paste(sigma," = ", 1)))

y = b_0 + b_1*x + rnorm(n, 0, 2) # moderate sigma

plot(x,y, col = "red", main = expression(paste(sigma," = ", 2)))

y = b_0 + b_1*x + rnorm(n, 0, 5) # large sigma

plot(x,y, col = "red" , main = expression(paste(sigma," = ", 5)))

If sigma is very high, then it may not even be possible to identify with which model to start the analysis of the data. Let us come back to the original situation. We have simulated an artificial data set of n rows.

par(mfrow=c(1,1))

sigma = 1

x = seq(1, 10, length.out = 20)

n = length(x)

y = b_0 + b_1 * x+ rnorm(n, 0, sigma)

plot(x,y, col = "red", lwd=2)

Given the scatter plot, we decide which model to start. After looking in to the scatter plot, you make an assumption about the population. You are planning to talk about something which is never known or it is impossible to know in a lifetime. Still we draw some inference about the population ONLY based on the assumption. So whether we had started with a correct assumption OR NOT, must be checked once the model has been developed.

par(mfrow=c(1,1))

plot(x, y, col = "red", lwd=2)

fit = lm(y ~ x)

summary(fit)

coef(fit)

print(c(b_0, b_1))

abline(fit, col = 1, lwd=2)

Note that we get an estimate of the population parameters based on the sample data. Usually it is called training data. But, since we have talked about the training and test set yet, I will skip the the use of training set in first few lectures. Later, once we start with the concept of statistical learning, these concepts will be made clear. Now, given the single estimate of beta_0 and beta_1 (usually denoted as beta_0_hat and beta_1_hat), we can not visualize the sampling distribution of these estimators. Note that the estimators are random variables as they are based on the sample data. If the experiment (or the survey) was carried out by another experimenter (surveyor) invariably these estimates will change. In the theory lecture, we have shown that beta_hat (vector of estimators) follows normal distribution (multivariate) with mean as true population parameters and variance-co-variance matrix. And the confidence intervals are constructed based on this normality property.

Now suppose that one does not know this information or the person did not attend the Applied Statistics-II lecture :). So how one can guess the distribution of beta_hat. Recall our simulation lecture. We can get an idea about the sampling distribution of beta_hat by repeating the process a large number of times. At every replication we simulate the response and estimate the coefficients and record them. If we do this process 1000 times, then we will have 1000 many values of beta_0_hat and beta_1_hat. By drawing the histogram of these values, we can guess the sampling distribution of beta_hat. In the following we repeat the process 500 times.

rep = 500 # Replicate the process 10 times

est_b_0 = numeric(rep)

est_b_1 = numeric(rep)

for(i in 1:rep){

y = b_0 + b_1 * x + rnorm(n, 0, sigma)

points(x, y, col = i+1)

fit = lm(y~x)

abline(fit, col = i+1, lwd =2)

est_b_0[i] = coef(fit)[1]

est_b_1[i] = coef(fit)[2]

}

est_b_0

par(mfrow=c(1,2))

hist(est_b_0, probability = TRUE, col = "grey", main = expression(beta[0]), ylim =c(0, 0.8))

points(mean(est_b_0), 0, col = "blue", lwd=2, cex=3, pch = "*")

points(b_0, 0, col = "red", lwd=2, cex=3, pch = "*")

curve(dnorm(x, mean =b_0, sd = sd(est_b_0)), add = TRUE, col = "red", lwd=2)

hist(est_b_1, col = "grey", probability = TRUE, main = expression(beta[1]))

points(mean(est_b_1), 0, col = "blue", lwd=2, cex=3, pch = "*")

points(b_1, 0, col = "red", lwd=2, cex=3, pch = "*")

curve(dnorm(x, mean =b_1, sd = sd(est_b_1)), add = TRUE, col = "red", lwd=2)

There is no surprise with the resulting histograms, which can be closely approximated by a normal density function. In addition, note that the average of the beta estimates are matching with the true value (using these values the data were simulated). The curves verifies the theoretical claims we made about the theoretical proof for the sampling distribution of the beta_hat.

Another important observation is that, after averaging all the beta_hat_values, we get the average very close to the true value. The red color and the blue color stars are matching perfectly. This proves that beta_hat is unbiased estimator of beta. In reality, we do not have this luxury. We have to give the decision based on a single sample of size n only. Since the sampling distribution is normal, so based on the single sample of size n, if we create the confidence interval, we will be correct most of the times. This is guaranteed by the central tendency property of the normal distribution. That is, if the linear regression assumptions are correct, beta_hat_values can not deviate much from the 3 standard deviation of the beta_hat_values. You may not happy with the histogram and overlay of the normal density function. If we increase the number of replications then the histogram would be closer and closer to the normal distribution. The following code can easily demonstrate this fact.

par(mfrow=c(1,1))

n = 20000

x = runif(n, 0, 1)

hist(x, probability = TRUE, col = "grey")

abline(h=1, col = "red", lwd=2)

So from this graph, you are more convinced that the sampling distribution of beta_hat is indeed normal :). Now as we discussed that assumptions play the most critical role to decide whether the model is useful or not. Let us check whether the assumptions are correct or not. We will follow the same trick. We will create simple cases artificially and see what is the problem. This approach is like reading the theories in a mathematics book before taking up the exercises. Although usually in statistics, it is suggested that starting with a real data would give the best idea. I agree, but this is an M.Sc. in Mathematics course, not a data science course for industry persons. Of course, understanding of real data would be given emphasize, but simultaneously the theory behind this also must be understood. The choice of R is just to better visualize/realize/understand the facts. Working with the machine learning concepts a flexibility with programming gives additional advantage. Hence, a programming choice on R has been included.

Now let us now check whether the assumptions are valid. In other words, given the scatter plot whether linear regression is a legitimate option to start for analyzing the data.

x = seq(1, 10, length.out = 20)

n = length(x)

sigma = 1

y = b_0 + b_1*x + rnorm(n, 0, sigma)

plot(x,y, lwd=2, col = "red")

fit = lm(y~x)

abline(fit, lwd=2, col = "blue")

par(mfrow=c(2,2))

plot(fit)

residuals(fit)

hist(residuals(fit), probability = TRUE, col = "grey")

curve(dnorm(x, mean =0, sd = sigma),add=TRUE, col = "red")

(Home work- I ) Here I shall give you a homework. Read the details about the quantile-quantile plot. How it is done. How it is helping to identify the the distribution. Demonstrate its utility using a positively skewed, negatively skewed, and symmetric distribution. Also check for a distribution with thicker tails. Create small case studies with simulated data sets. Understand these functions rnorm, qnorm, pnorm, dnorm related to the normal distribution. Remember that these four letter (r, q, p, d) are four very important letters in R. The following codes may be used as a starting point.

par(mfrow=c(1,1))

curve(dnorm(x, 0, 1), -6, 6, col = "red", lwd=2)

curve(dcauchy(x, 0, 1), add = TRUE, col = "blue", lwd=2)

legend("topright", c("Normal", "Cauchy"), lwd=c(2,2), col = c("red","blue") )

Homogeneity of variance: For x = 1 the variations of Y has to be sigma, similarly for others. So at each x value Y have a normal distribution with mean given by b_0 + b_1*x and standard deviation sigma. If the variance of Y changes at different choices of x, then the situation is called heteroschedasticity.

x = 1:10

n = length(x)

sigma = 1

rep = 100

b_0 = 1

b_1 = 1.5

for(i in 1:rep){

y = b_0 + b_1 * x + rnorm(n, 0, sigma)

if(i == 1){

plot(x, y , col = i, ylim=c(0,20), , main = "homogeneity of variance")

}

else{

points(x, y , col =i)

}

}

curve(b_0 + b_1 *x, col = "red", lwd=2, add = TRUE)

In the following we simulate the response in such a way that as x increases the variation in y also increases. It can be intuitively understood that in such a scenario the residual versus fitted values plots will show a funnel shaped scatter plot.

x = 1:10

n = length(x)

sigma = 1

rep = 100

b_0 = 1

b_1 = 1.5

for(i in 1:rep){

y = b_0 + b_1 * x + rnorm(n, 0, sigma* sqrt(x))

if(i == 1){

plot(x, y , col = i, ylim=c(0,20), main = "variance non-homogeneity")

}

else{

points(x, y , col =i)

}

}

curve(b_0 + b_1 *x, col = "red", lwd=2, add = TRUE)

The following piece of code will explain how we can compute the sample quantile. Note the use of the functions rnorm, order, qnorm.

x = rnorm(100)

order_x = x[order(x)]

n = 1:length(x)

prob_vals = (n-1/2)/length(x)

q = qnorm(prob_vals, mean = 0, sd=1) # Exact normal quantiles

plot(prob_vals, order_x, col = "red")

points(prob_vals, q, col = "blue")

Machine Learning: Lecture - II (July 18, 2018)

In the previous lecture, we finished our lecture with the concept of quantile-quantile plot. To elaborate it further, suppose we plot the distribution function of the standard normal random variable. The following plots describes the same normal distribution. Comparing the theoretical and sample quantiles we can check whether the data has been generated from a specified distribution or not.

par(mfrow=c(2,2))

curve(pnorm(x), -5, 5, col = "red", lwd=2, main = "Distribution Function")

curve(dnorm(x), -5, 5, col = "red", lwd=2, main = "Density Function")

curve(qnorm(x), 0, 1, col = "red", lwd=2, main = "Quantile Function")

x = rnorm(100)

hist(x, probability = TRUE, col = "lightgrey", main = "Histogram of x")

par(mfrow=c(1,1))

curve(pnorm(x), -5, 5, col = "red", lwd=2, main ="N(0,1)")

points(0, 0, lwd=2,col ="blue", pch ="*", cex= 3)

abline(v=0, lwd=2, col = "magenta")

abline(h = 0.5, lwd=2, col = "magenta")

points(0, 0.5, lwd=2, col = "black", cex=2, pch = 20)

points(0, 0.5, lwd=2, col = "black", cex=2, pch = 20)

# Quantile function

curve(qnorm(x), 0, 1, col = "red", lwd=2, main ="N(0,1)", xlab = "q")

points(seq(0, 1, length.out = 10), qnorm(seq(0, 1, length.out = 10)), cex =2, col = "black", pch = 20, lwd=2)

# Comparing with sample quantile

seq_p = seq(0, 1, length.out = 200)

qnorm(0.5)

pnorm(0)

seq_q = qnorm(seq_p)

plot(seq_p, seq_q, col = "red")

x = rnorm(100)

print(x)

order_x = x[order(x)] # look help(order)

print(order_x)

n = length(x)

seq_q_val = (1:n-1/2)/n # Continuity Correction

seq_q_val

plot(seq_q_val, order_x, col = "red", pch = "*")

legend("topleft", c("sample quantile"), col = "red", lwd=1, pch = "*")

points(seq_q_val, qnorm(seq_q_val), col = "blue", lwd=1)

Empirical distribution function. If the random variable is not continuous, then the quantile function needs to be defined separately. It is defined as inf{x| F(x)>q}. ecdf function stands for the empirical cumulative distribution function.

x = rpois(100, lambda = 2)

plot(ecdf(x), col = "red")

Let us come back to the original problem of regression. We were trying to investigate different diagnostic tests for regression.

x = seq(1, 10, length.out = 30)

sigma = 1

b_0 = 0.5

b_1 = 2

y = b_0 + b_1 * x + rnorm(length(x), 0, sigma)

plot(x, y, col = "red", lwd=2)

fit = lm(y ~ x)

abline(fit, col = "blue", lwd=2)

par(mfrow=c(2,2))

plot(fit)

In the following we will talking about the nonlinear relationship between the response and the predictors.

par(mfrow=c(1,1))

x = seq(0, 1, length.out = 40)

sigma = 0.1

y = b_0 + b_1*x^2 + rnorm(length(x), 0, sigma)

plot(x, y, col = "red", lwd=2)

Suppose we decided to go with a linear regression model. In the following we investigate the linear regression model.

fit = lm(y ~ x)

abline(fit, lwd =2, col = "blue")

par(mfrow=c(2,2))

plot(fit)

The above code generates four figures. These figures plays crucial role in determining the fact whether the choice of going with a linear regression model is reasonable. Essentially these plots (also called diagnostic plots) helps to examine the validity of the linear regression model assumptions. From the error versus fitted values graphs we understand the possibility of the existence of nonlinear relationship between the input and the response.

par(mfrow =c(1,1))

y_hat = coef(fit)[1] + coef(fit)[2]*x

plot(y_hat, residuals(fit), col = "blue", lwd=2)

abline(h=0, lwd=2)

One has to be cautious before removing the outliers. It might happen that the data in hand is positively skewed and a particular value in very high. In that case transformations are often found to be useful.

par(mfrow=c(1,1))

b_0 = 1

b_1 =0.5

sigma = 2

x = seq(1, 5, length.out = 30)

y = b_0 + b_1*exp(x) + rnorm(length(x), 0, sigma)

plot(x, y, col = "red", lwd=2)

abline(lm(y ~x), lwd=2, col = "blue")

par(mfrow=c(2,2))

plot(fit)

Some values are identified as outliers, but removal of them is not a good idea. In addition to that looking into the residual plots, one may chose the linear regression as a correct choice. Because, red line is close to zero throughout the fitted values. Let us consider another example in the following:

par(mfrow=c(1,1))

x = seq(0, 2, length.out = 60)

sigma = 0.2

y = b_0 + b_1*sin(pi*x) + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 + b_1*sin(pi*x),add= TRUE, lwd=2,col = "blue")

fit = lm(y ~ x)

abline(fit, col = "magenta", lwd=2)

par(mfrow=c(2,2))

plot(fit)

par(mfrow=c(1,1))

y_hat = coef(fit)[1] + coef(fit)[2]*x

plot(y_hat, residuals(fit), col = "blue", lwd=2)

abline(h=0, lwd=2)

The curved patterns indicate that there is a possibility of nonlinear relationship between the response and the predictor. Now the question is how to rectify it. Transformation often turns out to be important to deal with such situation. Recall the concepts of linearizable regression models. Many nonlinear regression models can be converted to a linear regression model by means of transformation. For example, if the true relationship between the response and the predictor is given by y = b_0 + b_1*x^2 +error, then the model can be converted to a linear one by taking the following transformations: Y = y, X = x^2, then the model cane be fitted and coefficient estimates can be obtained by means of linear regression Y = b_0 + b_1*X + error.

par(mfrow=c(1,1))

x = seq(0, 1, length.out = 40)

sigma = 0.1

y = b_0 + b_1*x^2 + rnorm(length(x), 0, sigma)

plot(x, y, col = "red", lwd=2)

Suppose we decided to go with a linear regression model. So what would be the problem? See that in some range of x-values the model is biased.

fit = lm(y ~ x)

abline(fit, lwd =2, col = "blue")

The regression model is said to be linear if it is linear with respect to the parameters. Hence, although the relationship between x and y is nonlinear but the problem can be converted to a linear regression problem.

x_2 = x^2 # The transformed variables

fit = lm(y ~ x_2)

y_hat = coef(fit)[1] + coef(fit)[2]*x_2

plot(y_hat, residuals(fit), col = "red")

abline(h =0, lty=2, lwd=2, col = "blue")

Heteroschedasticity: Another important assumption about the linear regression model is that the error terms have a constant variance. This is a very crucial point and must be checked as the hypothesis testing about the parameters and the associated confidence intervals are based on this assumptions. The non-constant variance of the error terms can be identified from the residuals versus fitted values plot. If we observe a funnel shaped behavior of the residuals, then heteroschedasticity is present. Often transformation of the response Y using a concave function (like sqrt(Y) or log(Y)) becomes helpful to stabilize the variance

x = seq(1, 10, length.out = 200)

b_0 = -1

b_1 = 2

sigma = 1.4

y = b_0 + b_1 * x + rnorm(length(x), 0, sigma*sqrt(x))

par(mfrow=c(1,1))

plot(x, y, col = "red", lwd=2)

fit = lm(y ~ x)

abline(fit, lwd=2, col = "blue")

par(mfrow=c(2,2))

plot(fit)

par(mfrow=c(1,1))

y_hat = coef(fit)[1] + coef(fit)[2]*x

plot(y_hat, residuals(fit), col = "red", lwd=2)

abline(h = 0, col = "blue", lwd=2)

Funnel shaped residuals with respect the the fitted values indicates heteroschedasticity. First indicate what are the problems it may cause. A simple simulation plan would be to consider two cases. Suppose we fix the population parameter values of beta_0 and beta_1 and sigma. In one cases approximate the sampling distribution of the beta_hat where the variance of errors are constant. Do the same thing for another case where the error variance is not constant at different levels of x_i values. Now compare the histograms. Use par(mfrow=c(2,2)) to plot the four histograms, two for the first case study and the other two for the second case study. Draw confidence intervals on the histograms by vertical lines. Try to identify the differences.

Problems with outliers:

x = seq(0, 1, length.out = 50)

b_0 = 0.3

b_1 = 0.7

sigma = 0.1

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

plot(x,y, col = "red", lwd=2)

length(y)

y[45] = 4

plot(x,y, lwd=2, col = "red")

fit_1 = lm(y~x)

abline(fit_1, col = "blue", lwd=2)

new_y = y[-45]

new_x = x[-45]

fit_2 = lm(new_y~ new_x)

abline(fit_2, lwd=2, col ="magenta", lty=1)

In the residual plot, the number 45 indicates the row number of the data set which has been identified as an outlier. By construction, we have considered the number 45 as outliers. R is very user-friendly in these aspects. Now let us compare the situations with and without having the outliers by graphical inspections.

resid_1 = residuals(fit_1)

resid_2 = residuals(fit_2)

y_hat_1 = coef(fit_1)[1] + coef(fit_1)[2]*x

y_hat_2 = coef(fit_2)[1] + coef(fit_2)[2]*new_x

par(mfrow=c(1,2))

plot(y_hat_1, resid_1, col = "red", lwd=2, main = "With outlier")

abline(h = 0, lwd=2, lty=2)

plot(y_hat_2, resid_2, col = "red", lwd=2, main = "without outlier")

abline(h = 0, lwd=2, lty=2)

hist(resid_1/sd(resid_1), probability = TRUE, main = "With outlier", col = "grey")

hist(resid_2/sd(resid_2), probability = TRUE, main = "Without outlier", col = "grey")

par(mfrow=c(2,2))

plot(fit_1)

plot(fit_2)

Leverage statistic: The outliers are the observations for which the response y is unusual given the values of x_i. Observations with high leverage denote the unusual values of x_i. Now the points which are both high leverage and outliers usually a dangerous combination. One has to be very cautious about them.

par(mfrow=c(1,1))

x = rep(NA, 101)

x[1:100] = seq(0, 1, length.out = 100)

x[101] = 2

y= b_0 + b_1 *x + rnorm(length(x), 0, sigma)

plot(x,y, col = "red", lwd=2)

fit = lm(y ~ x)

abline(fit, lwd=2, col = "blue")

abline(lm(y[-101]~x[-101]), lwd=2, col = "magenta")

lever_stat = numeric(length = length(x))

for(i in 1:length(x)){

lever_stat[i] = 1/length(x) + (x[i]-mean(x))^2/(sum((x[-i]-mean(x))^2 ))

}

studentized_resid = residuals(fit)/sd(residuals(fit))

plot(studentized_resid, lever_stat, col = "red", lwd=2, xlab = "Studentized residuals", ylab = "Leverage Statistics")

index = which.max(lever_stat)

points(studentized_resid[index], lever_stat[index], col = "blue", lwd=2, pch ="*", cex= 3)

text(studentized_resid[index]+0.3, lever_stat[index], index)

The situation becomes difficult when we have more than one predictors. The following example will make it clear. The idea is that the leverage point may fall within the range of values of x_1 and x_2 individually. But, once we plot them jointly on x_1-x_2 plane, the point is outside the range pairs of values taken by (x_1,x_2).

x1 = rnorm(100,0, 1)

x2 = 1*x1 + rnorm(100, 0, 0.5)

x1 = c(x1, 1)

x2 = c(x2, -2)

plot(x1, x2, col = "red")

Today's final assignments:

(Home work) Create a case study by describing the impact of the leverage points. Check whether the leverage points has a sizable impact on the conclusion. Compute the leverage statistics for each of the observations. Identify how the removal of the leverage points helps in improving the model. Read about Cook's distance and compute the statistic. Identify the utility of Cook's distance and list some functions in R that can be used to compute it.

(Home work) Read the section: Chapter No. 3, Page No. 92 of Introduction to Statistical Learning with Application in R

Machine Learning: Lecture - III (July 20, 2018) (01:30 PM - 03:30 PM)

Today's lecture we will finish up the major concepts related to the Multiple linear Regression. Till now we have discussed about the simple linear regression. Once we have more than one predictors then the situation falls under Multiple Linear Regression. Let us start with a simulated scenario. Remember that the concepts that we have learnt in the simple linear regression are going to be very useful. Let us create a case study where we have three input variables and a response variable. The specifications of the population regression plane (not line) are as follows:

n = 100 # length of the data set

x_1 = seq(1, 10, length.out = n)

x_2 = seq(0, 1, length.out = n)

x_3 = seq(-1,1, length.out = n)

b_0 = 1

b_1 = 0.1

b_2 = -1

b_3 = 10

sigma = 1

y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + rnorm(n, 0, sigma)

dim(y)

class(y)

length(y)

X = cbind(rep(1,n), x_1, x_2, x_3)

Link with the design matrix. If the model contains intercept, then the first column of the design matrix has all entries to be 1. If we plan to fit a model without intercept, then the first column is dropped. This can be easily performed in R by adding -1 in the formula for linear regression in the lm function. In the following, we store the data in a data.frame. It is a very commonly used data storage facility in R. Python has also DataFrame having similar functionalities like R data.frame.

head(X)

data = data.frame(y, x_1, x_2, x_3)

head(data)

dim(data)

names(data)

fit = lm(y ~ x_1 + x_2 + x_3, data = data)

summary(fit)

coef(fit)

As we see that the coefficient estimates for x_2 and x_3 appeared to be NA. Let us try to investigate it. The following code checks whether there are any missing values in the data. This might be a problem. sum(complete.cases(data)) returns the total number of observations without missing values. However, the following code tells that there are no missing values.

complete.cases(data)

sum(complete.cases(data))

Let us theoretically compute the value of beta_hat. The following code is used to compute it. Note the syntax for the matrix multiplication. This is a common mistake people commit in a big program. From the output, we can see that the inverse of the matrix does not exist. This is a typical situation where we use Shrinkage estimation methods.

beta_hat = solve(t(X)%*%X)%*%t(X)%*%y

Two main shrinkage methods are ridge and lasso. These are often termed statistical regularization concepts related to the linear regression models. We will discuss this in the model selection stage at later lectures. So, how to know that such a problem has occurred before computing the estimates from the formula itself. It might happen that the determinant is not exactly zero, but, the value is very small or closed to zero. Such a matrix is called an ill-conditioned matrix. Recall the condition number of a matrix. If the condition number of a matrix is large, then the matrix is called ill-conditioned. The matrix with determinant zero has the condition number infinity. Some graphical techniques are very useful to look for it. The scatter plot of all the variables may indicate the linearity among the predictors. The following codes do this job.

pairs(data)

Caution! After looking into the scatter plot it is not a good idea to drop any predictors from the model. It might happen that more than one predictor, after combined, has a significant impact on the response. Again, recall that analyzing the data is an art. It can not be done by studying some formulas or theorems. Let us go back to the situation where we do not have such problems. For simplicity, we simulate the predictors from some distribution randomly. We set the seed so that they do not change at later stages.

set.seed(123)

x_1 = runif(n, 1, 10)

x_2 = rnorm(n, 10, 3)

x_3 = runif(n, -1 , 1)

b_0 = 1

b_1 = 0.1

b_2 = -1

b_3 = 10

sigma = 1

y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3 + rnorm(n, 0, sigma)

data = data.frame(y, x_1, x_2, x_3)

X = cbind(rep(1,n), x_1, x_2, x_3)

pairs(data)


fit = lm(y ~ x_1 + x_2 +x_3, data = data)

summary(fit)

coef(fit)

In R regression without the intercept term can be conveniently performed using the following code. Note the inclusion of -1 in the formula of lm function. If we want to do it using the formula, then we have to drop the first column from the design matrix. Again, this can be easily performed in R. Examples are shown below for both cases.

fit = lm(y ~ - 1 + x_1 + x_2 +x_3, data = data)

summary(fit)

coef(fit)


X=X[,-1]

beta_hat = solve(t(X)%*%X)%*%t(X)%*%y

print(beta_hat)

Here again, I shall give you an assignment. Compute the confidence intervals for the beta at home. You can just compute the interval as beta_hat +/- 2*sd(beta_hat). In addition for verification purposes, obtain the sampling distribution of beta_hat by repeating this process a large number of times. This has already been once for the simple linear regression case. So this is just for improvement in coding practice. Let us come back to the regression with the intercept case. In the following, the diagnostic plots reinforce the ideas related to the regression diagnostics which we have done for simple linear regression.


par(mfrow=c(2,2))

plot(fit)

It might happen that there are nonlinear relationships between response and some predictors. Such a case can be handled by using suitable transformation. Some time Generalized Additive Models (GAM) are also appealing in many situations. We will talk about it in the concepts related to Statistical Regularization.

set.seed(123)

x_1 = runif(n, 1, 10)

x_2 = rnorm(n, 10, 3)

x_3 = runif(n, -1 , 1)

b_0 = 1

b_1 = -1

b_2 = -1

b_3 = 10

sigma = 2

y = b_0 + b_1*x_1^2 + b_2*x_2 + b_3*x_3 + rnorm(n, 0, sigma)

data = data.frame(y, x_1, x_2, x_3)

X = cbind(rep(1,n), x_1, x_2, x_3)

pairs(data)


fit = lm(y~., data = data)

summary(fit)

Here I shall give one more assignment. Draw a three-dimensional scatter plot of x_2, x_3, and y and try to understand the problem more clearly. We do not have a clear relationship between y and x_2 and y and x_3. However, the coefficient estimates are appearing to be statistically significant in the summary output. Note that all the p-values are very small.

In general, we deal with two kinds of problems in a multiple linear regression setup. The first problem is prediction and the second problem is inference. Read this idea from the ISLR. But in a simple language, we may consider the mathematical model as a black box. We are not interested in the contents of the black box. We only need good predictions from the model. However, the inference problem is associated with the selection of the predictors. Suppose, we are interested to predict the likelihood of the customer of a bank defaulting. There are more than 100 predictors are available to compute the likelihood of the person to be a defaulter. However, out of so many predictors, only a few of them may be useful or important. Rest maybe just making the model unnecessarily complex. Essentially, in an inference problem, we are concerned about the question that which predictors are important.

Today's final assignments:

(Homework) Prepare a writeup explaining that the existence of multicollinearity in the data does not hamper the unbiased property of the estimators, but the prediction intervals become very large. That is the prediction about future values contains a large amount of uncertainty. Discuss that in how many ways one can investigate the multicollinearity in the data. Create a case study by artificially incorporating multicollinearity in the predictors and check how it is impacting the confidence intervals.

(Homework) Describe the concept of auto-correlation in the context of linear regression. Create a case study under the linear regression set up where the errors are correlated. Identify the impact of auto-correlations in the conclusion of made from the linear regression model.

n=200 # number of rows in the data

x = cbind(rep(1:n), seq(1,10, length.out = n))

rho = 0.9 # vary rho from -1 to 1 for experiment

cov = matrix(data = NA, nrow = n, ncol = n) # covariance matrix

cov = diag(rep(1, dim(x)[1]))

for(i in 1:(n-1)){

cov[i,i+1] = rho

}

for(i in 2:n){

cov[i,i-1] = rho

}

library(mvtnorm) # multivariate normal random variable generation

b0 = 1 # population intercept

b1 = 1 # population slope

beta = c(b0, b1)

y = x%*%beta + t(rmvnorm(1, mean = rep(0, n), sigma = cov, method = "svd"))

plot(x[,2],y)


beta_hat = solve(t(x)%*%x)%*%t(x)%*%y

resid = y - x%*%beta_hat # using the formula, not using lm function

acf(resid)

plot(resid[1:(n-1)], resid[2:n])


DW_test = sum((resid[2:n]-resid[1:(n-1)])^2)/sum(resid^2)

DW_test


Durbin-Watson test statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation of the residuals. Thus, for r == 0, indicating no serial correlation, the test statistic equals 2. This statistic will always be between 0 and 4. The closer to 0 the statistic, the more evidence for positive serial correlation. The closer to 4, the more evidence for negative serial correlation

(Homework) Create a case study: Suppose that we have a data set with n rows and two variables y and x. y is the response and x is the predictor. Perform a linear regression and record the estimated values. Now create a data set of length 2n by copying the data twice. Again fit the regression line and record the estimated coefficients. Try to identify the pattern in the estimated values in these two cases. Now compare the sampling distribution of beta_hat of these two cases.

(Homework) Read the weighted least squares method based on the material provided in the class. Derive the equations to estimate the coefficients. Create a case study with simulated data in R or Python.


Machine Learning: Lecture - IV (July 23, 2018) (01:30 PM - 03:30 PM)

Suppose that we observe a quantitative response Y and p different predictors, X1,X2, · · · ,Xp. We assume that there is some relationship between Y and X = (X1,X2, · · · ,Xp), which can be written in the very general form

Y = f(X) + e

Here f is some fixed but unknown function of X1,X2, · · · ,Xp, and e is a random error term with mean zero, which is independent of X and has mean zero. In this formulation, f represents the systematic information that X provides about Y . The idea of statistical learning deals with several approaches to estimating the function f. The reasons behind estimating the function f are two folds, namely "prediction" and "inference". In many real life problems the information about the input variables are readily available, however, the output Y is difficult to obtain. In such cases by estimating f we can predict Y using

Y_hat = f_hat(X)

where f_hat represents the estimate of f and Y_hat represents the resulting prediction of Y . We have derived during the lecture that the expected squared difference between the predicted and actual value of Y can be broken into two components, E(Y-Y_hat)^2 = [f(X) - f_hat(X)]^2 + Var(e), where [f(X) - f_hat(X)]^2 represent the reducible component and Var(e) is the variance associated with the error term. The justification behind the terms reducible error is that we can make the error less by choosing a different f_hat so that the estimate of pass more closely through the observed Y values. However, we should not go too close to the data as well. It is important to note that the irreducible error always provide an upper bound of the accuracy of the prediction for Y. However, the bound is almost always unknown in practice.

An analyst may be too enthusiastic to reach very close to the observed Y values by increasing the flexibility of the model. For example, suppose X is a single predictor and we are interested to predict Y based on the observations (x_i,y_i), i=1,2,...,n. Instead of choosing a linear regression model we decided to make the prediction by a higher-order polynomial. As the degree of the polynomial increases, the shape of the fitted curve will be more flexible. The degree of the polynomial plays a crucial and deciding on the optimal degree of the polynomial should be supported by good prediction accuracy as well. In practice, we divide the data into two parts, one is called a training set and the other is called a test set. The training set is used to learn about the data, that is, to estimate the function f_hat, and the predictive performance of f_hat is checked on the test data. We seek to obtain an optimal degree of the polynomial, which is flexible but simultaneously possesses the best predictive accuracy on the test data.

In supervised learning set up typically the quality of fit is measured by the Mean Square Error (MSE), which is given by MSE = sum(y_i-y_hat_i)^2 /n. So for the train and test set, we have the notion of training MSE and test MSE. It can be easily convinced oneself that as the degree of the polynomial increases, training MSE will decrease as the polynomial will be more close to the actual data values. Whereas, the test MSE will increase. We essentially look for a statistical learning method that gives the smallest expected test MSE. We discuss three cases where the behavior of the test MSE can be easily understood. We consider three population description of the function f. (a) Linear function, f(X) = b_0 + b_1*X; (b) Moderate Non-linearity: f(X) = b_0 + b_1*sqrt(X); and (c) High non-linearity: f(X) = b_0 - b_1 *sin(x/60) + b_2*sqrt(x). We shall see the expected test MSE first decreases as we increase the flexibility, then again increases as flexibility increases after some optimal degree of the polynomial. This result of having a U shape expected test MSE is universal. But the train MSE always decreases as the order of the polynomial increases. The following graphs are very general in nature.

The following R Codes were used to generate these figures. The corresponding Python programs to generate similar figures are generated by Fatima Razi Ahmed Iffat (our M.Sc. student). The Python program has been put in the ML Python page.

The following code is generated to understand the behavior of expected training and test MSE when the true relationship between the response and the predictor is moderately nonlinear.

par(mfrow=c(1,2))

x = seq(1, 100, length.out = 100)

b_0 = 1

b_1 = 1

sigma = 0.5

y = b_0 + b_1 *sqrt(x) + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 + b_1*sqrt(x), lwd=2, col = "black", add = TRUE)

data = data.frame(x,y)

deg = 1:10

library(ISLR)

library(polynom)

rep = 1000

train_mse = matrix(NA, nrow = rep, ncol = length(deg))

test_mse = matrix(NA, nrow = rep, ncol = length(deg))

for(i in 1:rep){

train = sample(1:length(x), size = floor(0.6*length(x)) )

train_data = data[train,]

test_data = data[-train,]

for(d in deg){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

pred_y = fitted.values(fit)

if(i==1){

p = polynomial(coef = coef(fit))

fun = as.function(p)

curve(fun(x), lwd=2, col = d+1, add = TRUE)

}

train_mse[i,d] = mean((pred_y-train_data$y)^2)

pred_y_test = predict(fit, newdata = test_data)

test_mse[i,d] = mean((pred_y_test - test_data$y)^2)

}

}

plot(deg, colMeans(train_mse), type = "l", col = "red", lwd=2, ylim = c(0.1,0.5), xlab = "Degree of Flexibility", ylab = "MSE")

lines(deg, colMeans(test_mse), type = "l" , col = "blue", lwd=2)

legend("topright", c("train MSE", "test MSE"), col = c("red", "blue"), lwd=c(2,2), bty= "n")

abline(h = sigma^2, lwd=2, col = "grey", lty=4)


The following code is generated to understand the behavior of expected training and test MSE when the true relationship between the response and the predictor is linear.

x = seq(1, 100, length.out = 100)

b_0 = 1

b_1 = 0.5

sigma = 4

y = b_0 + b_1 *x + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 + b_1*x, lwd=2, col = "black", add = TRUE)

data = data.frame(x,y)

deg = 1:10

library(ISLR)

library(polynom)

rep = 500

train_mse = matrix(NA, nrow = rep, ncol = length(deg))

test_mse = matrix(NA, nrow = rep, ncol = length(deg))

for(i in 1:rep){

train = sample(1:length(x), size = floor(0.7*length(x)) )

train_data = data[train,]

test_data = data[-train,]

for(d in deg){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

pred_y = fitted.values(fit)

if(i==1){

p = polynomial(coef = coef(fit))

fun = as.function(p)

curve(fun(x), lwd=2, col = d+1, add = TRUE)

}

train_mse[i,d] = mean((pred_y-train_data$y)^2)

pred_y_test = predict(fit, newdata = test_data)

test_mse[i,d] = mean((pred_y_test - test_data$y)^2)

}

}

plot(deg, colMeans(train_mse), type = "l", col = "red", lwd=2, ylim = c(10, 30), xlab = "Degree of Flexibility", ylab = "MSE")

lines(deg, colMeans(test_mse), type = "l" , col = "blue", lwd=2)

legend("topright", c("train MSE", "test MSE"), col = c("red", "blue"), lwd=c(2,2), bty= "n")

abline(h = sigma^2, lwd=2, col = "grey", lty=4)

The following code is generated to understand the behavior of expected training and test MSE when the true relationship between the response and the predictor is highly nonlinear.

x = seq(1, 100, length.out = 100)

b_0 = 1

b_1 = 1

b_2 = 0.1

sigma = 0.01

y = b_0 - b_1 *sin(x/60) + b_2*sqrt(x) + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 - b_1 *sin(x/60) + b_2*sqrt(x), lwd=2, col = "black", add = TRUE)

data = data.frame(x,y)

deg = 1:10

library(ISLR)

library(polynom)

rep = 100

train_mse = matrix(NA, nrow = rep, ncol = length(deg))

test_mse = matrix(NA, nrow = rep, ncol = length(deg))

for(i in 1:rep){

train = sample(1:length(x), size = floor(0.8*length(x)) )

train_data = data[train,]

test_data = data[-train,]

for(d in deg){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

pred_y = fitted.values(fit)

if(i==1){

p = polynomial(coef = coef(fit))

fun = as.function(p)

curve(fun(x), lwd=2, col = d+1, add = TRUE)

}

train_mse[i,d] = mean((pred_y-train_data$y)^2)

pred_y_test = predict(fit, newdata = test_data)

test_mse[i,d] = mean((pred_y_test - test_data$y)^2)

}

}

plot(deg, colMeans(train_mse), type = "l", col = "red", lwd=2, xlab = "Degree of Flexibility", ylab = "MSE")

lines(deg, colMeans(test_mse), type = "l" , col = "blue", lwd=2)

legend("topright", c("train MSE", "test MSE"),

col = c("red", "blue"), lwd=c(2,2), bty= "n")

abline(h = sigma^2, lwd=2, col = "grey", lty=4)

Note the use of polynom package from R to construct the polynomial using the estimated coefficients. We can not directly plot the polynomial using the curve() function. The curve function accepts an object function as its argument. The polynomial() function returns an object of type polynomial. The as.function() essentially converts the polynomial object to a function object to be passed to curve() function for plotting. It is super easy and exciting too. The package deals with a lot of things related to orthogonal functions. We will explore it later times. The functions which starts with as play an important in conversion of data types in R. Some common functions are as.data.frame, as.matrix, as.vector, as.function, as.list, as.factor and many others. This makes the conversion among objects very user friendly.

Another note is that the predict function always accepts the newdata as a data.frame. If the data is stored in a matrix, it has to be passed within the as.data.frame() function.


Machine Learning: Lecture - V (July 27, 2018) (01:30 PM - 03:30 PM)

Many students are having difficulties in understanding the computation part of it. In particular, what is the difference between training MSE and expected training MSE. Ideally, if we perform a single train test validation, then we essentially obtain the sample test MSE and sample training MSE. Repeating this process on a large number of times, we may obtain the expected training and expected test MSE. So, in today's lecture, we will code the process explicitly, explain each of the steps. I shall try to unfold previous day's lecture by explaining each and every part of the program explicitly. You are requested to follow the process in Python and obtain the universal pattern of training and test MSE. Along with that we shall also discuss the concept of training bias and test bias. I shall explain the process by assuming that the population relationship is linear in nature. We shall estimate MSE and Bias for both training and test set. We shall fit various degrees of polynomials (degree is the measure of flexibility) and obtain their estimates based on a large number of training and test set partitions. The program is done step by step. You are requested to follow the codes step by step and write the algorithm on you own on a piece of paper.

set.seed(123)

# input variable specification

x = seq(1, 10, length.out = 100)

# fixation of the population parameter

b_0 = 1 # intercept

b_1 = 2 # slope

sigma = 1 # population error standard deviation

y = b_0 + b_1 * x + rnorm(length(x), 0, sigma)


# Lets store them in a data frame

data = data.frame(x,y)

plot(x,y, col = "red")


# Now we do not know the population function f(x). We plan to estimate it using some f_hat

fit_1 = lm(y ~ x, data = data)


training_mse = mean((fit_1$residuals)^2)

print(training_mse)

abline(fit_1, col = "blue")


y_hat = coef(fit_1)[1] + coef(fit_1)[2]*x

resids = y-y_hat

training_mse = mean(resids^2)

print(training_mse)


We used the complete data for training the model, so we do not have any data to test the predictive performance of the model. So let us separate the data in to two parts

# training set and test set

View(data)

train_prop = 0.8

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

print(train_index)

train_data = data[train_index, ]

dim(train_data)

test_data = data[-train_index,]

dim(test_data)

fit_1 = lm(y ~ x, data = train_data)

# computation of training mse

y_hat = coef(fit_1)[1] + coef(fit_1)[2]*x

resids = train_data$y - y_hat

training_mse = mean(resids^2)

print(training_mse)

# computation of test mse

y_pred = coef(fit_1)[1] + coef(fit_1)[2]*test_data$x

test_y = test_data$y

test_resid = y_pred - test_y

test_mse = mean(test_resid^2)

print(test_mse)


plot(x,y , col = "red")

points(test_data$x, test_data$y, col = "blue", lwd=2)

points(test_data$x, y_pred, col = "magenta", pch = "*", cex=2)

legend("topleft", c("train data", "test data", "y pred"), col= c("red", "blue", "magenta"), lwd=c(1,2,1))

# lets check for higher order polynomial

# I stands for wrapper to create the variable x^2 and consider as a separate input

fit_2 = lm(y ~ x + I(x^2), data = train_data)

resids = residuals(fit_2)

training_mse = mean(resids^2)

print(training_mse)


plot(x,y, col = "red")

curve(coef(fit_2)[1] + coef(fit_2)[2]*x + coef(fit_2)[3]*x^2,

col = "blue", add =TRUE)


# computation of test mse

y_pred = coef(fit_2)[1] + coef(fit_2)[2]*test_data$x + coef(fit_2)[3]*test_data$x^2

test_y = test_data$y

test_resid = y_pred - test_y

test_mse = mean(test_resid^2)

print(test_mse)


# Computation of third order polynomials

fit_3 = lm(y ~ x + I(x^2) + I(x^3), data = train_data)

resids = residuals(fit_3)

training_mse = mean(resids^2)

print(training_mse)


plot(x,y, col = "red")

curve(coef(fit_3)[1] + coef(fit_3)[2]*x + coef(fit_3)[3]*x^2 + coef(fit_3)[4]*x^3,

col = "blue", add =TRUE)


# computation of test mse

y_pred = coef(fit_3)[1] + coef(fit_3)[2]*test_data$x + coef(fit_3)[3]*test_data$x^2 + coef(fit_3)[4]*test_data$x^3

test_y = test_data$y

test_resid = y_pred - test_y

test_mse = mean(test_resid^2)

print(test_mse)


library(ISLR)

fit_5 = lm(y ~ poly(x, 5, raw = TRUE), data = train_data)

summary(fit_5)

coef(fit_5)

plot(x,y, col = "red" )

library(polynom)

estimate_f = polynomial(coef = coef(fit_5))

print(estimate_f)

curve(estimate_f(x), -1, 1) # It does not work

class(estimate_f) # Note that it is not a function object, rather it is a polynomial object

estimate_f = as.function(estimate_f)

class(estimate_f)

curve(estimate_f(x), 0, 10, col = "blue", add =TRUE, lwd=2)


Similarly we can go for further order of the polynomial and each time we will compute the training and test mse and plot them against the degree of the polynomial. We should chose the degree of the polynomial which has the smallest test mse

Merely doing the above procedure will not guarantee the selection of the optimal degree. Both train and test mses are the estimates based on a single train and test validation. So to obtain an accurate estimate of the expected train and test mse, we have to carry out the above process several times. Compute the average of all the train and test mses and plot them against the degree of the polynomial. Then choose the optimal level flexibility based on the graph.

# input variable specification

x = seq(1, 10, length.out = 100)

# fixation of the population parameter

b_0 = 1 # intercept

b_1 = 2 # slope

sigma = 5 # population error standard deviation

y = b_0 + b_1 * x + rnorm(length(x), 0, sigma)


# Lets store them in a data frame

data = data.frame(x,y)

plot(x,y, col = "red")


rep = 100

degree = 1:10

train_prop = 0.7


training_mse = matrix(data = NA, nrow = rep, ncol = length(degree))

test_mse = matrix(data = NA, nrow = rep, ncol = length(degree))


training_bias = matrix(data = NA, nrow = rep, ncol = length(degree))

test_bias = matrix(data = NA, nrow = rep, ncol = length(degree))



plot(x,y, col = "red")

library(polynom)

library(ISLR)

for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[train_index, ]

test_data = data[-train_index,]

for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

fit_poly = polynomial(coef = coef(fit))

fun_poly = as.function(fit_poly)

if(i == 1){

curve(fun_poly(x), add = TRUE, col = d)

}

# computation of training mse

y_hat = predict(fit, newdata = train_data)

training_mse[i, d] = mean((y_hat - train_data$y)^2)

# computation of test mse

y_pred = predict(fit, newdata = test_data)

test_mse[i, d] = mean((y_pred - test_data$y)^2)

# Computing training bias

training_bias[i,d] = mean(y_hat - b_0 - b_1*train_data$x)^2

# Computing test bias

test_bias[i,d] = mean(y_pred - b_0 - b_1*test_data$x)^2

}

}


avg_training_mse = colMeans(training_mse)

avg_test_mse = colMeans(test_mse)

avg_training_bias = colMeans(training_bias)

avg_test_bias = colMeans(test_bias)


y_lim = c(min(avg_training_mse, avg_test_mse), max(avg_training_mse, avg_test_mse))

plot(degree, avg_training_mse, col = "red", lwd=2,

type = "l", ylim = y_lim)

lines(degree, avg_test_mse, col = "blue", lwd=2)

legend("topleft", c("train MSE", "test MSE"), col = c("red","blue"), lwd=c(2,2))


y_lim = c(min(avg_training_bias, avg_test_bias), max(avg_training_bias, avg_test_bias))

plot(degree, avg_training_bias, col = "red", type = "l", ylim = y_lim, lwd=2)

lines(degree, avg_test_bias, col = "blue", type = "l", lwd=2)

legend("topleft", c("train bias", "test bias"), col = c("red","blue"), lwd=c(2,2))


(Homework) Replicate the above work in Python. Consider different levels of non-linearity in the population relationship. Linear, moderately nonlinear and highly nonlinear. Investigate the patterns of expected training mse, expected test mse and expected training bias and expected test bias.

Machine Learning: Lecture - VI (July 30, 2018) (01:30 PM - 03:30 PM)

Today we have mainly discussed about three topics: K-nearest neighbor regression, Discussion on Variance Inflation Factor, Overview of Resampling methods. First we shall discuss the K-nearest neighbor method. In the following code we have used the function knnreg from the caret package to perform KNN regression. This is a sample code to show that how increasing the value K is changing the smoothness of the predictive curve. Here I did not include how the optimal choice of K to be decided. I have put that as an exercise, because, we require the same idea of train and test validation (several times). This code will help you to do the home works at the end of this session.

Local methods in Regression: K-nearest neighbour regression

x = seq(1, 10, length.out = 50)

b_0 = 1

b_1 = 0.3

sigma = 0.5

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

data = data.frame(x,y)


library(caret)

fit = knnreg(as.data.frame(x),y, k=1)

summary(fit)

par(mfrow=c(2,3))

plot(x,y, col = "red", main = "k=1")

fit_1 = knnreg(as.data.frame(data$x), data$y, k=1)

y_hat = predict(fit_1, as.data.frame(data$x))

lines(x, y_hat, col = "grey", lty=1)

Note the edge effect. If k = 2, then nearest two values including the current values are averaged. But when the observation is on the edge then only two nearest single value will be considered for regression

plot(x,y, col = "red", main = "k=2")

fit_2 = knnreg(as.data.frame(data$x), data$y, k=2)

y_hat = predict(fit_2, as.data.frame(data$x))

lines(x, y_hat, col = "magenta", lty=2)


plot(x,y, col = "red", main = "k=3")

fit_3 = knnreg(as.data.frame(data$x), data$y, k=3)

y_hat = predict(fit_3, as.data.frame(data$x))

lines(x, y_hat, col = "green", lty=3)


plot(x,y, col = "red", main = "k=4")

fit_4 = knnreg(as.data.frame(data$x), data$y, k=4)

y_hat = predict(fit_4, as.data.frame(data$x))

lines(x, y_hat, col = "blue", lty=4)


plot(x,y, col = "red", main = "k=5")

fit_5 = knnreg(as.data.frame(data$x), data$y, k=5)

y_hat = predict(fit_5, as.data.frame(data$x))

lines(x, y_hat, col = "red", lty=5)


plot(x,y, col = "red", main = "k=10")

fit_10 = knnreg(as.data.frame(data$x), data$y, k=10)

y_hat = predict(fit_10, as.data.frame(data$x))

lines(x, y_hat, col = "red", lty=10)

In the following code we considered the population relationship between the response and the predictor to be nonlinear. Let us check the performance of K-nearest neighbour method.

(Homework) As we observed that if we increase the value of K the predicted curve becomes smoother. With a small value of K, the estimated curve is closer to the observed y values. This might indication a over fitting. But essentially we will have a lower bias. As we increase the value of K, bias (training) will increase. So the question is how we can chose an optimal value of K. The homework is to write a program to select an optimal value of K. Perform train and test validation multiple times and obtain the estimates of training and test mse. Plot them against 1/K. Select the optimal level of flexibility based on the lowest test mse. Mark the optimal value of K using some big dot points.

The following code is essentially the solution of the problem. But, I suggest you write the codes on your own and do the experiment in Python. Note the use of the function sort(). The sort is used to plot multiple regression outputs in a single plot. You may experiment without writing the sort() and check how the plot becomes wiggly.

# Experiment on the flexibility with linear true relationship. Do the same process using a nonlinear function.

x = seq(1, 10, length.out = 50)

b_0 = 1

b_1 = 0.3

sigma = 0.5

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

par(mfrow=c(1,2))

plot(x,y, col = "red")


train_prop = 0.8


k_vals = 1:10

rep = 50

training_mse_knn = matrix(data = NA, nrow = rep, ncol = length(k_vals))

test_mse_knn = matrix(data = NA, nrow = rep, ncol = length(k_vals))


for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[sort(train_index), ]

test_data = data[-sort(train_index),]

for(k in k_vals){

fit_knn = knnreg(as.data.frame(train_data$x), train_data$y, k=k)


# Computation of training mse

y_hat = predict(fit_knn, as.data.frame(train_data$x))

training_mse_knn[i, k] = mean((y_hat - train_data$y)^2)

# The following code is just used for plotting purpose, similar to what we have done for different orders of polynomials

if(i==1){

lines(train_data$x, y_hat, col = k, lty=k)

}

# Computation of test mse

y_pred = predict(fit_knn, as.data.frame(test_data$x))

test_mse_knn[i, k] = mean((y_pred - test_data$y)^2)

}

}

avg_training_mse_knn = colMeans(training_mse_knn)

avg_test_mse_knn = colMeans(test_mse_knn)

y_lim = c(min(avg_training_mse_knn, avg_test_mse_knn),max(avg_training_mse_knn, avg_test_mse_knn))

plot(1/k_vals, avg_training_mse_kn, col = "red", ylim = y_lim, type = "b", ylab = "MSE", lwd=2)

lines(1/k_vals, avg_test_mse_knn, col = "blue", type = "b", lwd=2)

legend("bottomright", c("training mse", "test mse"), col = c("red", "blue"),lwd=c(2,2), bty = "n")

(Homework) In the above example, we have assumed the population to have linear relationship. Create a case study at which you consider the population function f(X) to be nonlinear. You may consider the function to be Y = b_0 + b_1 *sqrt(X) + e. Then perform a train-test validation to chose the optimal choice of K. Compare the pattern of test mse with the situation where f(X) is linear.

(Homework) Read Section 2.5 from ESL (Local Methods in High Dimensions). The concepts are already discussed in the class. Read Section 3.5 from ISLR. Get a visual perception about the curse of dimensionality.

Caution! KNN is a non-parametric method as we do not use any mathematical function whose parameters need to be estimated from the training data. KNN performs slightly worse than linear regression when the relationship is linear, but much better than linear regression for non-linear situations. Do the above home works, the conclusions will be clear to you. Hence, "in real life situation in which the true relationship is unknown, one might raw the conclusion that KNN should be favored over linear regression because it will at worst be slightly inferior than linear regression if the true relationship is linear, and may give substantially better results if the true relationship is non-linear. But that is not always the case, because in higher dimension, KNN often performs worse than linear regression" (James et al., ISLR).

In the next session, we have discussed about the multi-collinearity problem. In previous lectures, we have raised this issue and explained that if two or more variables are linearly related, then (X'X) becomes singular or nearly singular, hence the uncertainty regarding the parameter estimates increases. When two predictors are linearly related, it can be identified from pairs() plots or by looking into the correlation matrix of the predictors. Two predictors with high correlation may give an indication of multicollinearity. However, the problem becomes severe when a particular predictor can be written as a linear combination of the others. It is difficult or impossible to identify from the pairwise scatter plots. In such situation, we look into the variance inflation factor. It is computed like this: Suppose we have p many predictors, X_1, X_2,...,X_p. We regress X_1 on X_2,...,X_p by a linear model and compute the coefficient of determination, call it R^2_1|-1. Do this process for X_2 as well with X_1,X_3,...,X_p as predictors and record the coefficient of determination values. So, there will be p many R-square values. Note that a high value of R-square (R-square_i|-i, i=1,2,...,p) indicates a strong indication of linear relationship among the predictors. Compute the variance inflation factor as VIF(beta_hat_j) = 1/(1 - R-square_j|-j) for j=1,2,...,p. High value VIF indicates which variable is responsible for inflating the variance or source of multicollinearity.

(Homework) Create a case study to demonstrate the impact of multicollinearity. Identify the source of multicollinearity by computing VIF on your own. R function VIF is available there. You can cross check it.

The third topic we discussed, was the Leave one out cross validation, K-fold cross validation. Read the sections 5.1 (Cross validation): subsections: 5.1.1 - 5.1.4. Lecture material on Software Lab are available with codes. A brief outline is given here:

[Leave One Out Cross Validation (LOOCV)]

Suppose that we have the dataset (x_i, y_i), i = 1, · · · , n. By this method, the set of observations are divided into two parts; the test set contains only one observation (say (x_1, y_1) ) and the training set contains the data (x_i, y_i) i = 2, · · · , n. The model is fitted on the training set and test MSE is computed by predicting the response y1 in the test set, call it MSE_1. The process is continued for (x_2, y_2),...,(x_n, y_n) and MSE_2,...,MSE_n are computed respectively. The average test MSE is computed as

MSE =(MSE_1+MSE_2+ · · · +MSE_n)/n.

[K-Fold Cross Validation (K-Fold CV)]

The sample is divided into k non-overlapping equal size sets. For each k sets we use k-1 fold for training and the remaining for testing. The mean squared error, MSE_1, is then computed on the observations in the held-out fold. This procedure is repeated k times, each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE_1, MSE_2, . . . , MSE_k. The k fold CV estimate for the test MSE is computed by averaging these values

MSE =(MSE_1+MSE_2+ · · · +MSE_k)/k

  • LOOCV is a special case of the k-fold cross validation, where k is chosen to be equal to the number of observations.

  • LOOCV is time consuming due the number of experiments performed, if the dataset is large enough.

  • With k-fold cross-validation, the computation time is reduced due to the number of experiments performed is less as well as the all the observations are eventually used to train and test the model.

In the following code we check the behaviour of the test MSE for a 50-50 train test validation. As we see that the estimates of the test MSE varies enormously across different test sets. The reason is that the prediction heavily depends on what kind of observations are included in the test set and training set. In addition there is no guarantee that each observation is used as a training set or as a test set. Hence, for some training set the model might perform very well, but the performance on the test set may not be satisfactory and vice-versa. In such a situation, we need a resampling method that guarantees that each of the observations are being included in the training and also in the test set. Leave One Out Cross Validation is one such method. In LOOCV we use the data point (x_1,y_1) as test set and rest (n-1) observations are used to train the statistical learning algorithm. Similar process is carried out for (x_2,y_2), (x_3,y_3)... and so on. So essentially we end up with n training and test mse which are averaged to get the LOOCV estimate of training and test mse. The model with the smallest test mse may be considered for prediction. However, note that we need to build the model n times, which is really computationally expensive. So we need some resampling technique that is computationally less expensive and simultaneously ensures that all the observations are being used in both training and testing purpose. This optimization can be performed using the K-fold cross validation. Essentially, we separate the complete set observations into K distinct subsets (say B_1, B_2,...,B_k). We train the model using the training subsets B_2,B_3,..,B_k and test the performance of the model on B_1. Similar process is carried out with all subsets. Essentially, we end up with K training and test mse values, which are averaged to obtain the estimates of expected training and expected test mse. The model with the smallest test mse can be considered for prediction. Note the here the number of times the training would be carried out, is small. Suppose, we have n=100 and K = 10, then for LOOCV 100 times the algorithm needs to be run, whereas for K-fold cross validation only 10 times the algorithm needs to be done. It is a very powerful method and commonly appears in the real data analysis which are published in fact in top tier journals. In the following code change the value of prop_train and check the behaviour of training and test mse. You may also do experimentation using different values of population standard deviation sigma. It is also suggested to explore it by varying the degree of non-linearity in the population relationship of x and y.

par(mfrow=c(1,1))

x = seq(1, 10, length.out = 100)

b_0 = 1

b_2 = 2

sigma = 0.8

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

plot(x,y,col = "red")

data = data.frame(x,y)

prop_train = 0.8

rep = 10

training_mse = matrix(data = NA, nrow = rep, ncol = length(degree))

test_mse = matrix(data = NA, nrow = rep, ncol = length(degree))


for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[train_index, ]

test_data = data[-train_index,]


for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

fit_poly = polynomial(coef = coef(fit))

fun_poly = as.function(fit_poly)

if(i == 1){

curve(fun_poly(x), add = TRUE, col = d)

}

# computation of training mse

y_hat = predict(fit, newdata = train_data)

training_mse[i, d] = mean((y_hat - train_data$y)^2)

# computation of test mse

y_pred = predict(fit, newdata = test_data)

test_mse[i, d] = mean((y_pred - test_data$y)^2)

}

}

matplot(test_mse, type = "l", lwd=2, xlab = "Flexibility")

matplot(training_mse, type = "l", lwd=2, xlab = "Flexibility")

Machine Learning: Lecture - VII (August 6, 2018) (01:30 PM - 03:30 PM)

In the previous lecture, we have discussed and compared some resampling methods. These methods are basically used to estimate the expected training error and expected test error rate. From today's lecture, we will concentrate more on the statistical properties of machine learning methods. Without having any dilemma, we shall refer to different techniques like machine learning methods instead of statistical methods, although, most of the methods were originated much earlier time before the term machine learning became popular.

Recall the concept of parametric methods, in which we assumed some mathematical functions relating to the predictors and the response. So the problem of understanding the data boils down to estimating the parameters of the assumed function (model) efficiently. There are several estimation processes are made available for estimating the parameters. Today, we start with our discussion with two popular methods: (a) Maximum Likelihood estimation and (b) Method of Least Squares. Due to limitations of mathematical annotations in the google web page, I am giving here book names and pages which have been covered during the lecture. We started with a review of the method of maximum likelihood estimation by giving an example, where we estimated the rate parameter of the Poisson(lambda) distribution based on a sample of size n. We recalled the likelihood equation.

(Homework) Write the log-likelihood function for the Poisson(lambda) distribution based on a sample of size n. Use the function optim() from R to minimize the negative log-likelihood function. Check your result with the analytical calculation.

We have studied the following things during the class: Book: All of Statistics by Larry Wasserman

  • Definition of the Simple Linear Regression Model (Section 13.1)

  • Derivation of Maximum Likelihood and Least Squares Estimators for Simple Linear Regression.

  • Theorem: Under the assumption of normality, the least-squares estimator is also the maximum likelihood estimator.

  • Properties of the Least Squares Estimators. (Theorem 13.8) (Homework)

  • We have also discussed the consistency, asymptotic normality of the LSE. Also derived the (1-alpha)% confidence interval and Wald test to reject the null hypothesis that H_0: beta_1=0 versus H_1: beta_1 is not equal to 0.

  • Then we moved back to multiple linear regression. Since the derivations are already done, we did not derive the results here again. Although we have discussed the unbiased estimators of sigma^2.

  • Then we have discussed model selection. We have defined the prediction risk. We have also seen the training error is a downward bias estimate of the prediction risk. The reason for this bias is that the data are being used twice: to estimate the parameters and to estimate the risk. The bias is -2*Cov(Y_hat_i, Y_i). Thus when we fit a very complex model with many parameters, the covariance will be large and the bias of the training error gets worse. Thus we need some other measures to estimate the risk. In the next lecture, we shall discuss other measures of prediction risk.

(Homework) Consider the regression model through the origin model: Y_i = beta* X_i + e. Find the least-squares estimate for theta beta. Find the standard error of the estimate. Find the conditions that guarantee that the estimate is consistent.

(Homework) Exercises 13.6, Problem number: 6, 7 from All of Statistics by Larry Wassarman. These problems are related to real data.


Machine Learning: Lecture - VIII (August 9, 2018) (10:30 AM - 12:30 PM)

In today's lecture, we shall explore different measures of prediction risk. In the previous lecture we have seen the training error is a downward biased estimate of the prediction risk R(S). S is a subset of {1,2,...,p} where p is the number of predictors. and Chi_S = {X_j | j belongs to S}. beta_S denote the coefficients of the corresponding set of predictors. X_S denote the design matrix associated with this subset of predictors. Prediction Risk R(S) = sum_{i=1}^nE(Y_hat_i(S) - Y_i^*)^2, where Y_hat^* denote the value of a future observation of Y_i at the covariate value X_i.

Now to chose a model with an optimal level of complexity we need to estimate the prediction error more accurately. We Since training error rate has a downward bias, it can not be used as it will always tend to select the model with higher complexity level. Note that, in the present discussion the complexity is understood in the following sense: a linear regression regression model with a large number of predictors has more complexity than the model with less number of predictors. As the model complexity increases the training error rate (R_hat_tr(S)) decreases monotonically. If we keep on adding covariates, then training error will decrease. Note that R_hat_tr(S) essentially talks about the lack of fit that does not bother (or insensitive) to the increasing complexity (hence tend to perform worse on the test set). So we need to devise some measure to estimate R(S) (prediction risk) that incorporates both lack of fit and the complexity of the model. Mallow's Cp is an improved estimator of R(S), which is defined as R_hat(S) = R_hat_tr(S) + 2|S|sigma_hat^2, where |S| is the number of predictors in the set S and sigma_hat^2 is an estimate of sigma^2 which is obtained from the full model, that is, the model with all the predictors included. Thus Mallow's Cp = Lack of Fit + Bias correction.

The above discussion essentially reflects the fact that finding a good model involves trading of fit and model complexity. Another important measure is AIC (Akaike Information Criterion), which is defined as AIC(S) = log-likelihood(S) - |S|, this is basically of the form: goodness of fit - complexity.

(Homework) Consider a model y = f(x) + e, with f(x) = 0.1 + 2*sqrt(x) -sin(x), and sigma = 0.3 (the population relation). Simulate a data set of size n=100. Fit polynomials of varying degree (1:10) and each case compute the AIC. Check whether AIC is selecting the same model as selected by the K-fold cross validation. In the class, we have already derived the expression for the log-likelihood and elaborated that how the AIC needs to be computed.

(Homework) Assume a linear regression model with normal errors. Take sigma to be known. Show that the model with the highest AIC is the model with the lowest Mallow's Cp statistic.

We have also discussed the LOOCV and K-fold CV estimators of prediction risk. Refer to the previous lecture for elaboration explanation of the these concepts.

(Homework) In the previous lecture, we have seen that if a data set contains n rows, then to compute the LOOCV error the model needs to be build n times, which may not be computationally feasible. However, this laborious job can be shortened by using the fact that R_LOOCV_hat(S) = sum_{i=1}^n(Y_i - Y_i_hat(S))^2/(1 - U_{ii}(S)), where U_ii(S) is the ith diagonal element of the matrix U(S) =X_S (X_S'X_S)'X_S'. Try to identify the matrix from the linear model lectures. This formula essentially states that we do not need to build the model n times. Prove the formula.

Another measure, that we have discussed is that BIC (Bayesian Information Criterion), which is defined as BIC(S) - log-likelihood(S) - (|S|/2)*log(n). So it is understood that we will select the model that maximizes the AIC or BIC and minimizes Mallow's Cp.

An interesting fact is that for linear regression, Mallow's Cp and cross-validation often yield essentially the same results so one might as well use Mallow's Cp method in linear regression contexts. However, for more complex problems cross-validation is preferred.

BIC is almost identical with AIC except that it puts a more severe penalty for complexity. Thus it leads to chose a smaller model than the other methods. The BIC has a Bayesian interpretation. Let S = {S_1,...S_m} be m candidate models out of which we are planning to select the best model. We assign a discrete uniform prior on the set S, that is, P(S_i) = 1/m, i = 1,2,...,m. It can be shown that the posterior probability for a model is approximately, P(S_j|data ) = exp(BIC(S_j))/sum_{k=1}^m(exp(BIC(S_k))).

We have also discussed about the prior probability, Bayes theorem, posterior probability distribution. We put light on the idea that posterior is proportional to the product of likelihood and prior.

We have studied the above sections from the book All of Statistics by Larry Wasserman (Chapter 13). Our next lecture plan is (a) Forward Selection and Backward Selection of Features (b) Zheng-Loh Model Selection Method.

(Homework) All of Statistics: Chapter 13: Problem No. 6, 7 (data related)

(Homework) All of Statistics: Chapter 13: Problem No. 9, 10 (conceptual)


Machine Learning: Lecture - IX (August 10, 2018) (01:30 PM - 03:30 PM)

We have started our discussion with two available formulas of Cp. In the previous lecture, we have seen that Mallow's Cp = R_hat_tr(S) + 2|S|sigma_hat^2. Another form of Cp, let us call it Cp' which is written as Cp' = R_hat_tr(S)/sigma_hat^2 + 2|S|-n. It is to be noted that both Cp and Cp' are equivalent, as there is a one-to-one correspondence between these two formulas which is given by Cp = sigma_hat^2(Cp' +n). So the model with the smallest Cp is the same model with the smallest Cp'.

After we got an understanding of different selection criteria, we decided to select the model out of several candidate models (nested!). We studied four different selection criteria. Here I am not writing them again but shall mention the details. The Section 6.1 from ISLR. Best Subsection Selection (Algorithm 6.1, Subsection 6.1.1), Forward stepwise selection (Algorithm 6.2, Section 6.1.2 and 6.1.3). We have also discussed the hybrid approach and motivation for Adjusted R^2.

(Homework) A short note on Akaike Information Criterion and Kullback-Liebler information criterion is uploaded in the following link: Click on A Short Note On AIC. Solve three problems given in the note.

In the following we compute the K-L information by using the integrate() function from R. Check the assignment in the above short note on AIC. For numerical integration in Python one cane use integrate() function from scipy.stats library.

par(mfrow=c(1,1))

alpha = 4

beta = 4

fun_gamma = function(x){

(1/beta^alpha)*exp(-x/beta)*x^(alpha-1)/gamma(alpha)

}

curve(fun_gamma(x), 0, 70, col = 1, ylab = "density", lwd=2, ylim=c(0,0.1))


alpha_w = 2 # note the change in parameter naming. Do not use alpha and beta again

beta_w = 20

fun_Weibull = function(x){

(alpha_w/beta_w)*(x/beta_w)^(alpha_w-1)*exp(-(x/beta_w)^alpha_w)

}

curve(fun_Weibull(x), 0, 70, col = 2, ylab = "density", add = TRUE, lwd=2, lty=2)

legend("topright", legend = c("gamma (4,4)", "Weibull (2,20)"),

bty= "n", col = c(1,2), lwd=c(2,2), lty = c(1,2))

fun_KL = function(x){

fun_gamma(x)*log(fun_gamma(x)/fun_Weibull(x))

}

integrate(fun_KL, 0, 100)

curve(fun_KL(x), 0, 1100)

Machine Learning: Lecture - X (August 20, 2018) (01:30 PM - 03:30 PM)

Today, we have mainly discussed various facts related to AIC and its connection to likelihood. We have also discussed a historical development of AIC. Most of them are discussed in the Short Note on AIC, which has been uploaded in the previous day's lecture. We have mainly discussed the connection between Kullback-Liebler information and its connection to AIC. In the following, I am giving the R codes to compute AIC. We use the definition following the book All of Statistics, so that that AIC = log(maximum likelihood) - number of parameters. Let us compute the AIC to choose the optimal order of the polynomial. This is the same problem that we have discussed in connection to the estimation of prediction error. At the end, we also compare the performance of AIC and K-fold cross validation. The following codes are mainly developed my student Dipali Mestry. So all credit goes to her for today's updates. In the following, we simulated the response variable (y) using the function f(x) = 0.1 + 2*sqrt(x) -sin(x), where x is chosen from the interval (1,5). In the following program, we have considered the linear regression set up with normally distributed errors. We have to look for the model that has the maximum AIC. The following code must be run using a training sample, however, the complete data has been used. The following code is only for demonstration purpose.

Check the behavior of the data carefully. Although, we have used some nonlinear function, but the behavior of the data is essentially moderately linear.

library(polynom)

par(mfrow=c(1,2))

sigma = 0.3 # population error specification

n = 100

x = seq(1,5,length.out = n) # predictor variable

y = 0.1 + 2*sqrt(x) -sin(x)+rnorm(n,0,sigma) # simulated response

data=data.frame(x,y) # store predictor and response in data.frame

plot(x,y,col="red") # look at the plot

degree = 1:10 # order of the polynomials to be tested

AIC=numeric(length(degree)) # allocation for computed AIC values

for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data =data)

sigma_hat_sq=sum((fit$residuals)^2)/n

fit_poly = polynomial(coef = coef(fit))

f=as.function(fit_poly) # creating function object

curve(f, col = d, add=TRUE) # add the fitted function on the plot

# computation of likelihood for AIC computation.

likeliood=prod((1/(sqrt(2*pi)*sqrt(sigma_hat_sq)))*exp(-(1/(2*(sigma_hat_sq)))*(y-f(x))^2))

s = d + 1 + 1 # coefficients + intercept + sigma

AIC[d]=log(likeliood)-s # As per the definition of All of Statistics

}

AIC

max(AIC)

plot(degree,AIC,col="red",main="using AIC", lwd=2)

ind = which.max(AIC)

points(degree[ind], AIC[ind], pch = "*", lwd=2, cex= 3, col ="blue")

In the following, we compare the above problem in the light of train-test validation. We check whether train test validation also gives the same model as obtained by AIC values. We consider the training proportion as 0.7 and the process has been carried out rep = 100 times to compute the average prediction error.

n=100

x=seq(1,5,length.out = n)


sigma = 0.3# population error standard deviation

y=0.1 + 2*sqrt(x) -sin(x)+rnorm(n,0,sigma)


# Lets store them in a data frame

data = data.frame(x,y)


rep = 100

degree = 1:10

train_prop = 0.7


training_mse = matrix(data = NA, nrow = rep, ncol = length(degree))

test_mse = matrix(data = NA, nrow = rep, ncol = length(degree))


training_bias = matrix(data = NA, nrow = rep, ncol = length(degree))

test_bias = matrix(data = NA, nrow = rep, ncol = length(degree))


library(ISLR)

for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[train_index, ]

test_data = data[-train_index,]

for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

fit_poly = polynomial(coef = coef(fit))

fun_poly = as.function(fit_poly)

# computation of training mse

y_hat = predict(fit, newdata = train_data)

training_mse[i, d] = mean((y_hat - train_data$y)^2)

# computation of test mse

y_pred = predict(fit, newdata = test_data)

test_mse[i, d] = mean((y_pred - test_data$y)^2)

}

}


avg_training_mse = colMeans(training_mse)

avg_test_mse = colMeans(test_mse)

y_lim = c(min(avg_training_mse, avg_test_mse), max(avg_training_mse, avg_test_mse))

plot(degree, avg_training_mse, col = "red", lwd=2,

type = "l", ylim = y_lim,main="k-fold")

lines(degree, avg_test_mse, col = "blue", lwd=2)

legend("topleft", c("train MSE", "test MSE"), col = c("red","blue"), lwd=c(2,2), bty="n")

points(3, min(avg_test_mse), col="black", lwd = 3)

Non-parametric bootstrap method can be used in the estimation of model selection frequencies (pi_i) and it estimates of precision that include model selection uncertainty. The fundamental idea of the model based sampling theory approach to statistical inference is that the data arise as a sample from some conceptual probability distribution f. Bootstrap method allows the computation of measures of our inference uncertainty by having a simple empirical estimate of f and sampling from this estimated distribution. More discussion we will do in the later lectures. The following contains the algorithm to estimate the model selection frequencies using Bootstrap; this is essentially gives an empirical distribution depicting the likelihood of the models over the model space.

B_dataset = matrix(NA,nrow = n,ncol = 2)

degree = 1:10

AIC = numeric(length(degree))

max_AIC_model_no=numeric(B)

for(j in 1:B){

index = sample(1:length(x), size = n, replace = TRUE)

B_dataset = data[index,]

new_data = data.frame(B_dataset[,1],B_dataset[,2])

for(d in degree){

fit = lm(new_data[,2] ~ poly(new_data[,1], degree = d, raw = TRUE), data =new_data)

sigma_hat_sq=sum((fit$residuals)^2)/n

fit_poly = polynomial(coef = coef(fit))

f=as.function(fit_poly)

likeliood=prod((1/(sqrt(2*pi)*sqrt(sigma_hat_sq)))*exp(-(1/(2*(sigma_hat_sq)))*(new_data[,2]-f(new_data[,1]))^2))

s=d+1

AIC[d]=log(likeliood)-s

}

AIC

t = which.max(AIC)

max_AIC_model_no[j]=t

}

prob_model_selection=numeric(length(degree))

for(i in degree){

t=match(max_AIC_model_no,i)

n1=length(t)-length(which(is.na(t)))

prob_model_selection[i]=n1/B

}

prob_model_selection

which.max(prob_model_selection)

barplot(prob_model_selection,names.arg = as.character(degree))

As we can see that the polynomial of degree three has the highest frequency. This is in agreement with the models selected by the AIC and train test validation.

Machine Learning: Lecture - XI (August 24, 2018) (01:30 PM - 03:30 PM)

Machine Learning: Lecture - XII (August 27, 2018) (01:30 PM - 03:30 PM)

The above two days lectures are updated in the following material. The material also includes tutorial problems. Click the link: Introduction to Bayes Estimation

Mid Semester Examination (September 10, 2018): Question Paper


Machine Learning: Lecture - XVI (August 27, 2018) (01:30 PM - 03:30 PM)

Machine Learning: Lecture - XVII (October 5, 2018) (01:30 PM - 03:30 PM)

We have continued the discussion on Univariate Delta Method and continued to the Multivariate Delta Method. We have considered examples of ratio estimator to demonstrate the use of delta method for two variable case. Univariate delta method was demonstrated using the following R code. A pdf containing the details of Delta method would be updated. But, we mainly discussed the topics from Page 240, Section 5.5.4, Casella and Berger (2002) and Section 9.9 from Chapter 9, Wasserman (2005).

Demonstration of Delta Method:

n = 1000 # Size of the sample, consider varying it and investigating the histogram

lambda = 2

x = rpois(n, lambda = lambda)

hist(x, probability = TRUE)


rep = 10000 # Required to visualize the distribution

sample_mean = numeric(rep)

vals = matrix(data = NA, nrow = rep, ncol= n)

for(i in 1:rep){

vals[i,] = rpois(n, lambda = lambda)

}

sample_mean = rowMeans(vals)

hist(sample_mean, probability = TRUE, col = "lightgrey", breaks = 70)

hist(sqrt(n)*(sample_mean-lambda),

probability = TRUE, col = "grey",main = bquote(sqrt(n)(bar(X)-lambda)))

#Validity of the central limit theorem

curve(dnorm(x, mean = 0, sd = sqrt(lambda)),

add = TRUE, col = "red", lwd=2)



# Delta method

hist(1/sample_mean, probability = TRUE, main = bquote(1/bar(X)), col = "grey")

points(1/lambda, 0, pch = 19, lwd=2, cex=2,col= "red")

mean(1/sample_mean)


hist(sqrt(n)*(1/sample_mean-1/lambda), probability = TRUE,

col = "grey",main = bquote(sqrt(n)(1/bar(X) - 1/lambda)))

curve(dnorm(x, mean = 0, sd = sqrt(1/lambda^3)), col = "red", lwd=2, add = TRUE)


mean(1/sample_mean^3)

1/lambda^3

Machine Learning: Lecture - XVIII (October 8, 2018) (01:30 PM - 03:30 PM

L_1 and L_2 regularization. Ridge and Lasso Regression. Connection to Bayesian.


Machine Learning: Lecture - XIX (October 12, 2018) (01:30 PM - 03:30 PM)

Machine Learning: Lecture - XX (October 15, 2018) (01:30 PM - 03:30 PM)

The following topics are covered in the above two lectures

  • Introduction to Multivariate Statistics: Mean vector, Variance Covariance Matrix, Standard deviation matrix, Correlation matrix. Ref: Chapter 3, Applied Multivariate Statistical Analysis by Johnson and Wichern.

  • Principal Component Analysis (PCA): Concepts of Population Principal Components, Maximization of the quadratic forms subject to constraints, Spectral decomposition, Computation of Principal components, Related problems, Geometry of the principal directions. Ref: Chapter 8, Applied Multivariate Statistical Analysis by Johnson and Wichern. Some notes related to matrix algebra in connection to PCA is attached here.

  • Principal Component Regression (PCR): Ref: Chapter 6, ISLR, Section: 6.3: Dimension Reduction Methods, Section 6.3.1: PCR.

The number of components can be decided by using Cross validation in the PCR. Useful R functions are pcr, validationplot from the library pls.

Machine Learning: Lecture - XXI (October 19, 2018) (01:30 PM - 03:30 PM)

Factor Analysis: Chapter 9, Applied Multivariate Statistical Analysis by Johnson and Wichern.

Estimation of parameters in Factor Model: Principal Component Factor Analysis

Multivariate Normal Distribution: Chapter 4, Applied Multivariate Statistical Analysis by Johnson and Wichern.

Machine Learning: Lecture - XXII (October 22, 2018) (01:30 PM - 03:30 PM)

Chapter 7. Moving Beyond Linearity: (Polynomial Regression, Piecewise Constant Regression Models, Basis Function Approach, Piecewise Polynomials, Spline basis representation, Constraints and Splines, Introduction to Generalized Additive Models


Machine Learning: Lecture - XXIII (October 25, 2018) (01:30 PM - 03:30 PM)

Chapter 7: ISLR - Spline Basis Representation for fitting cubic splines, How to chose the number of knots and location of knots, idea of smoothing splines and the choice of the tuning parameter, The idea of local regression and its computation, some further discussion on generalized linear models and smoothing splines.


The following R Codes were used to generate these figures. The corresponding Python programs to generate similar figures are generated by Fatima Razi Ahmed Iffat (our M.Sc. student). The Python program has been put in the ML Python page.

The following code is generated to understand the behavior of expected training and test MSE when the true relationship between the response and the predictor is moderately nonlinear.

par(mfrow=c(1,2))

x = seq(1, 100, length.out = 100)

b_0 = 1

b_1 = 1

sigma = 0.5

y = b_0 + b_1 *sqrt(x) + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 + b_1*sqrt(x), lwd=2, col = "black", add = TRUE)

data = data.frame(x,y)

deg = 1:10

library(ISLR)

library(polynom)

rep = 1000

train_mse = matrix(NA, nrow = rep, ncol = length(deg))

test_mse = matrix(NA, nrow = rep, ncol = length(deg))

for(i in 1:rep){

train = sample(1:length(x), size = floor(0.6*length(x)) )

train_data = data[train,]

test_data = data[-train,]

for(d in deg){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

pred_y = fitted.values(fit)

if(i==1){

p = polynomial(coef = coef(fit))

fun = as.function(p)

curve(fun(x), lwd=2, col = d+1, add = TRUE)

}

train_mse[i,d] = mean((pred_y-train_data$y)^2)

pred_y_test = predict(fit, newdata = test_data)

test_mse[i,d] = mean((pred_y_test - test_data$y)^2)

}

}

plot(deg, colMeans(train_mse), type = "l", col = "red", lwd=2, ylim = c(0.1,0.5), xlab = "Degree of Flexibility", ylab = "MSE")

lines(deg, colMeans(test_mse), type = "l" , col = "blue", lwd=2)

legend("topright", c("train MSE", "test MSE"), col = c("red", "blue"), lwd=c(2,2), bty= "n")

abline(h = sigma^2, lwd=2, col = "grey", lty=4)


The following code is generated to understand the behavior of expected training and test MSE when the true relationship between the response and the predictor is linear.

x = seq(1, 100, length.out = 100)

b_0 = 1

b_1 = 0.5

sigma = 4

y = b_0 + b_1 *x + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 + b_1*x, lwd=2, col = "black", add = TRUE)

data = data.frame(x,y)

deg = 1:10

library(ISLR)

library(polynom)

rep = 500

train_mse = matrix(NA, nrow = rep, ncol = length(deg))

test_mse = matrix(NA, nrow = rep, ncol = length(deg))

for(i in 1:rep){

train = sample(1:length(x), size = floor(0.7*length(x)) )

train_data = data[train,]

test_data = data[-train,]

for(d in deg){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

pred_y = fitted.values(fit)

if(i==1){

p = polynomial(coef = coef(fit))

fun = as.function(p)

curve(fun(x), lwd=2, col = d+1, add = TRUE)

}

train_mse[i,d] = mean((pred_y-train_data$y)^2)

pred_y_test = predict(fit, newdata = test_data)

test_mse[i,d] = mean((pred_y_test - test_data$y)^2)

}

}

plot(deg, colMeans(train_mse), type = "l", col = "red", lwd=2, ylim = c(10, 30), xlab = "Degree of Flexibility", ylab = "MSE")

lines(deg, colMeans(test_mse), type = "l" , col = "blue", lwd=2)

legend("topright", c("train MSE", "test MSE"), col = c("red", "blue"), lwd=c(2,2), bty= "n")

abline(h = sigma^2, lwd=2, col = "grey", lty=4)

The following code is generated to understand the behavior of expected training and test MSE when the true relationship between the response and the predictor is highly nonlinear.

x = seq(1, 100, length.out = 100)

b_0 = 1

b_1 = 1

b_2 = 0.1

sigma = 0.01

y = b_0 - b_1 *sin(x/60) + b_2*sqrt(x) + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

curve(b_0 - b_1 *sin(x/60) + b_2*sqrt(x), lwd=2, col = "black", add = TRUE)

data = data.frame(x,y)

deg = 1:10

library(ISLR)

library(polynom)

rep = 100

train_mse = matrix(NA, nrow = rep, ncol = length(deg))

test_mse = matrix(NA, nrow = rep, ncol = length(deg))

for(i in 1:rep){

train = sample(1:length(x), size = floor(0.8*length(x)) )

train_data = data[train,]

test_data = data[-train,]

for(d in deg){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

pred_y = fitted.values(fit)

if(i==1){

p = polynomial(coef = coef(fit))

fun = as.function(p)

curve(fun(x), lwd=2, col = d+1, add = TRUE)

}

train_mse[i,d] = mean((pred_y-train_data$y)^2)

pred_y_test = predict(fit, newdata = test_data)

test_mse[i,d] = mean((pred_y_test - test_data$y)^2)

}

}

plot(deg, colMeans(train_mse), type = "l", col = "red", lwd=2, xlab = "Degree of Flexibility", ylab = "MSE")

lines(deg, colMeans(test_mse), type = "l" , col = "blue", lwd=2)

legend("topright", c("train MSE", "test MSE"),

col = c("red", "blue"), lwd=c(2,2), bty= "n")

abline(h = sigma^2, lwd=2, col = "grey", lty=4)

Note the use of polynom package from R to construct the polynomial using the estimated coefficients. We can not directly plot the polynomial using the curve() function. The curve function accepts an object function as its argument. The polynomial() function returns an object of type polynomial. The as.function() essentially converts the polynomial object to a function object to be passed to curve() function for plotting. It is super easy and exciting too. The package deals with a lot of things related to orthogonal functions. We will explore it later times. The functions which starts with as play an important in conversion of data types in R. Some common functions are as.data.frame, as.matrix, as.vector, as.function, as.list, as.factor and many others. This makes the conversion among objects very user friendly.

Another note is that the predict function always accepts the newdata as a data.frame. If the data is stored in a matrix, it has to be passed within the as.data.frame() function.


Machine Learning: Lecture - V (July 27, 2018) (01:30 PM - 03:30 PM)

Many students are having difficulties in understanding the computation part of it. In particular, what is the difference between training MSE and expected training MSE. Ideally, if we perform a single train test validation, then we essentially obtain the sample test MSE and sample training MSE. Repeating this process on a large number of times, we may obtain the expected training and expected test MSE. So, in today's lecture, we will code the process explicitly, explain each of the steps. I shall try to unfold previous day's lecture by explaining each and every part of the program explicitly. You are requested to follow the process in Python and obtain the universal pattern of training and test MSE. Along with that we shall also discuss the concept of training bias and test bias. I shall explain the process by assuming that the population relationship is linear in nature. We shall estimate MSE and Bias for both training and test set. We shall fit various degrees of polynomials (degree is the measure of flexibility) and obtain their estimates based on a large number of training and test set partitions. The program is done step by step. You are requested to follow the codes step by step and write the algorithm on you own on a piece of paper.

set.seed(123)

# input variable specification

x = seq(1, 10, length.out = 100)

# fixation of the population parameter

b_0 = 1 # intercept

b_1 = 2 # slope

sigma = 1 # population error standard deviation

y = b_0 + b_1 * x + rnorm(length(x), 0, sigma)


# Lets store them in a data frame

data = data.frame(x,y)

plot(x,y, col = "red")


# Now we do not know the population function f(x). We plan to estimate it using some f_hat

fit_1 = lm(y ~ x, data = data)


training_mse = mean((fit_1$residuals)^2)

print(training_mse)

abline(fit_1, col = "blue")


y_hat = coef(fit_1)[1] + coef(fit_1)[2]*x

resids = y-y_hat

training_mse = mean(resids^2)

print(training_mse)


We used the complete data for training the model, so we do not have any data to test the predictive performance of the model. So let us separate the data in to two parts

# training set and test set

View(data)

train_prop = 0.8

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

print(train_index)

train_data = data[train_index, ]

dim(train_data)

test_data = data[-train_index,]

dim(test_data)

fit_1 = lm(y ~ x, data = train_data)

# computation of training mse

y_hat = coef(fit_1)[1] + coef(fit_1)[2]*x

resids = train_data$y - y_hat

training_mse = mean(resids^2)

print(training_mse)

# computation of test mse

y_pred = coef(fit_1)[1] + coef(fit_1)[2]*test_data$x

test_y = test_data$y

test_resid = y_pred - test_y

test_mse = mean(test_resid^2)

print(test_mse)


plot(x,y , col = "red")

points(test_data$x, test_data$y, col = "blue", lwd=2)

points(test_data$x, y_pred, col = "magenta", pch = "*", cex=2)

legend("topleft", c("train data", "test data", "y pred"), col= c("red", "blue", "magenta"), lwd=c(1,2,1))

# lets check for higher order polynomial

# I stands for wrapper to create the variable x^2 and consider as a separate input

fit_2 = lm(y ~ x + I(x^2), data = train_data)

resids = residuals(fit_2)

training_mse = mean(resids^2)

print(training_mse)


plot(x,y, col = "red")

curve(coef(fit_2)[1] + coef(fit_2)[2]*x + coef(fit_2)[3]*x^2,

col = "blue", add =TRUE)


# computation of test mse

y_pred = coef(fit_2)[1] + coef(fit_2)[2]*test_data$x + coef(fit_2)[3]*test_data$x^2

test_y = test_data$y

test_resid = y_pred - test_y

test_mse = mean(test_resid^2)

print(test_mse)


# Computation of third order polynomials

fit_3 = lm(y ~ x + I(x^2) + I(x^3), data = train_data)

resids = residuals(fit_3)

training_mse = mean(resids^2)

print(training_mse)


plot(x,y, col = "red")

curve(coef(fit_3)[1] + coef(fit_3)[2]*x + coef(fit_3)[3]*x^2 + coef(fit_3)[4]*x^3,

col = "blue", add =TRUE)


# computation of test mse

y_pred = coef(fit_3)[1] + coef(fit_3)[2]*test_data$x + coef(fit_3)[3]*test_data$x^2 + coef(fit_3)[4]*test_data$x^3

test_y = test_data$y

test_resid = y_pred - test_y

test_mse = mean(test_resid^2)

print(test_mse)


library(ISLR)

fit_5 = lm(y ~ poly(x, 5, raw = TRUE), data = train_data)

summary(fit_5)

coef(fit_5)

plot(x,y, col = "red" )

library(polynom)

estimate_f = polynomial(coef = coef(fit_5))

print(estimate_f)

curve(estimate_f(x), -1, 1) # It does not work

class(estimate_f) # Note that it is not a function object, rather it is a polynomial object

estimate_f = as.function(estimate_f)

class(estimate_f)

curve(estimate_f(x), 0, 10, col = "blue", add =TRUE, lwd=2)


Similarly we can go for further order of the polynomial and each time we will compute the training and test mse and plot them against the degree of the polynomial. We should chose the degree of the polynomial which has the smallest test mse

Merely doing the above procedure will not guarantee the selection of the optimal degree. Both train and test mses are the estimates based on a single train and test validation. So to obtain an accurate estimate of the expected train and test mse, we have to carry out the above process several times. Compute the average of all the train and test mses and plot them against the degree of the polynomial. Then choose the optimal level flexibility based on the graph.

# input variable specification

x = seq(1, 10, length.out = 100)

# fixation of the population parameter

b_0 = 1 # intercept

b_1 = 2 # slope

sigma = 5 # population error standard deviation

y = b_0 + b_1 * x + rnorm(length(x), 0, sigma)


# Lets store them in a data frame

data = data.frame(x,y)

plot(x,y, col = "red")


rep = 100

degree = 1:10

train_prop = 0.7


training_mse = matrix(data = NA, nrow = rep, ncol = length(degree))

test_mse = matrix(data = NA, nrow = rep, ncol = length(degree))


training_bias = matrix(data = NA, nrow = rep, ncol = length(degree))

test_bias = matrix(data = NA, nrow = rep, ncol = length(degree))



plot(x,y, col = "red")

library(polynom)

library(ISLR)

for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[train_index, ]

test_data = data[-train_index,]

for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

fit_poly = polynomial(coef = coef(fit))

fun_poly = as.function(fit_poly)

if(i == 1){

curve(fun_poly(x), add = TRUE, col = d)

}

# computation of training mse

y_hat = predict(fit, newdata = train_data)

training_mse[i, d] = mean((y_hat - train_data$y)^2)

# computation of test mse

y_pred = predict(fit, newdata = test_data)

test_mse[i, d] = mean((y_pred - test_data$y)^2)

# Computing training bias

training_bias[i,d] = mean(y_hat - b_0 - b_1*train_data$x)^2

# Computing test bias

test_bias[i,d] = mean(y_pred - b_0 - b_1*test_data$x)^2

}

}


avg_training_mse = colMeans(training_mse)

avg_test_mse = colMeans(test_mse)

avg_training_bias = colMeans(training_bias)

avg_test_bias = colMeans(test_bias)


y_lim = c(min(avg_training_mse, avg_test_mse), max(avg_training_mse, avg_test_mse))

plot(degree, avg_training_mse, col = "red", lwd=2,

type = "l", ylim = y_lim)

lines(degree, avg_test_mse, col = "blue", lwd=2)

legend("topleft", c("train MSE", "test MSE"), col = c("red","blue"), lwd=c(2,2))


y_lim = c(min(avg_training_bias, avg_test_bias), max(avg_training_bias, avg_test_bias))

plot(degree, avg_training_bias, col = "red", type = "l", ylim = y_lim, lwd=2)

lines(degree, avg_test_bias, col = "blue", type = "l", lwd=2)

legend("topleft", c("train bias", "test bias"), col = c("red","blue"), lwd=c(2,2))


(Homework) Replicate the above work in Python. Consider different levels of non-linearity in the population relationship. Linear, moderately nonlinear and highly nonlinear. Investigate the patterns of expected training mse, expected test mse and expected training bias and expected test bias.

Machine Learning: Lecture - VI (July 30, 2018) (01:30 PM - 03:30 PM)

Today we have mainly discussed about three topics: K-nearest neighbor regression, Discussion on Variance Inflation Factor, Overview of Resampling methods. First we shall discuss the K-nearest neighbor method. In the following code we have used the function knnreg from the caret package to perform KNN regression. This is a sample code to show that how increasing the value K is changing the smoothness of the predictive curve. Here I did not include how the optimal choice of K to be decided. I have put that as an exercise, because, we require the same idea of train and test validation (several times). This code will help you to do the home works at the end of this session.

Local methods in Regression: K-nearest neighbour regression

x = seq(1, 10, length.out = 50)

b_0 = 1

b_1 = 0.3

sigma = 0.5

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

plot(x,y, col = "red")

data = data.frame(x,y)


library(caret)

fit = knnreg(as.data.frame(x),y, k=1)

summary(fit)

par(mfrow=c(2,3))

plot(x,y, col = "red", main = "k=1")

fit_1 = knnreg(as.data.frame(data$x), data$y, k=1)

y_hat = predict(fit_1, as.data.frame(data$x))

lines(x, y_hat, col = "grey", lty=1)

Note the edge effect. If k = 2, then nearest two values including the current values are averaged. But when the observation is on the edge then only two nearest single value will be considered for regression

plot(x,y, col = "red", main = "k=2")

fit_2 = knnreg(as.data.frame(data$x), data$y, k=2)

y_hat = predict(fit_2, as.data.frame(data$x))

lines(x, y_hat, col = "magenta", lty=2)


plot(x,y, col = "red", main = "k=3")

fit_3 = knnreg(as.data.frame(data$x), data$y, k=3)

y_hat = predict(fit_3, as.data.frame(data$x))

lines(x, y_hat, col = "green", lty=3)


plot(x,y, col = "red", main = "k=4")

fit_4 = knnreg(as.data.frame(data$x), data$y, k=4)

y_hat = predict(fit_4, as.data.frame(data$x))

lines(x, y_hat, col = "blue", lty=4)


plot(x,y, col = "red", main = "k=5")

fit_5 = knnreg(as.data.frame(data$x), data$y, k=5)

y_hat = predict(fit_5, as.data.frame(data$x))

lines(x, y_hat, col = "red", lty=5)


plot(x,y, col = "red", main = "k=10")

fit_10 = knnreg(as.data.frame(data$x), data$y, k=10)

y_hat = predict(fit_10, as.data.frame(data$x))

lines(x, y_hat, col = "red", lty=10)

In the following code we considered the population relationship between the response and the predictor to be nonlinear. Let us check the performance of K-nearest neighbour method.

(Homework) As we observed that if we increase the value of K the predicted curve becomes smoother. With a small value of K, the estimated curve is closer to the observed y values. This might indication a over fitting. But essentially we will have a lower bias. As we increase the value of K, bias (training) will increase. So the question is how we can chose an optimal value of K. The homework is to write a program to select an optimal value of K. Perform train and test validation multiple times and obtain the estimates of training and test mse. Plot them against 1/K. Select the optimal level of flexibility based on the lowest test mse. Mark the optimal value of K using some big dot points.

The following code is essentially the solution of the problem. But, I suggest you write the codes on your own and do the experiment in Python. Note the use of the function sort(). The sort is used to plot multiple regression outputs in a single plot. You may experiment without writing the sort() and check how the plot becomes wiggly.

# Experiment on the flexibility with linear true relationship. Do the same process using a nonlinear function.

x = seq(1, 10, length.out = 50)

b_0 = 1

b_1 = 0.3

sigma = 0.5

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

par(mfrow=c(1,2))

plot(x,y, col = "red")


prop_trail = 0.8


k_vals = 1:10

rep = 50

training_mse_knn = matrix(data = NA, nrow = rep, ncol = length(k_vals))

test_mse_knn = matrix(data = NA, nrow = rep, ncol = length(k_vals))


for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[sort(train_index), ]

test_data = data[-sort(train_index),]

for(k in k_vals){

fit_knn = knnreg(as.data.frame(train_data$x), train_data$y, k=k)


# Computation of training mse

y_hat = predict(fit_knn, as.data.frame(train_data$x))

training_mse_knn[i, k] = mean((y_hat - train_data$y)^2)

# The following code is just used for plotting purpose, similar to what we have done for different orders of polynomials

if(i==1){

lines(train_data$x, y_hat, col = k, lty=k)

}

# Computation of test mse

y_pred = predict(fit_knn, as.data.frame(test_data$x))

test_mse_knn[i, k] = mean((y_pred - test_data$y)^2)

}

}

avg_training_mse_knn = colMeans(training_mse_knn)

avg_test_mse_knn = colMeans(test_mse_knn)

y_lim = c(min(avg_training_mse_knn, avg_test_mse_knn),max(avg_training_mse_knn, avg_test_mse_knn))

plot(1/k_vals, avg_training_mse_kn, col = "red", ylim = y_lim, type = "b", ylab = "MSE", lwd=2)

lines(1/k_vals, avg_test_mse_knn, col = "blue", type = "b", lwd=2)

legend("bottomright", c("training mse", "test mse"), col = c("red", "blue"),lwd=c(2,2), bty = "n")

(Homework) In the above example, we have assumed the population to have linear relationship. Create a case study at which you consider the population function f(X) to be nonlinear. You may consider the function to be Y = b_0 + b_1 *sqrt(X) + e. Then perform a train-test validation to chose the optimal choice of K. Compare the pattern of test mse with the situation where f(X) is linear.

(Homework) Read Section 2.5 from ESL (Local Methods in High Dimensions). The concepts are already discussed in the class. Read Section 3.5 from ISLR. Get a visual perception about the curse of dimensionality.

Caution! KNN is a non-parametric method as we do not use any mathematical function whose parameters need to be estimated from the training data. KNN performs slightly worse than linear regression when the relationship is linear, but much better than linear regression for non-linear situations. Do the above home works, the conclusions will be clear to you. Hence, "in real life situation in which the true relationship is unknown, one might raw the conclusion that KNN should be favored over linear regression because it will at worst be slightly inferior than linear regression if the true relationship is linear, and may give substantially better results if the true relationship is non-linear. But that is not always the case, because in higher dimension, KNN often performs worse than linear regression" (James et al., ISLR).

In the next session, we have discussed about the multi-collinearity problem. In previous lectures, we have raised this issue and explained that if two or more variables are linearly related, then (X'X) becomes singular or nearly singular, hence the uncertainty regarding the parameter estimates increases. When two predictors are linearly related, it can be identified from pairs() plots or by looking into the correlation matrix of the predictors. Two predictors with high correlation may give an indication of multicollinearity. However, the problem becomes severe when a particular predictor can be written as a linear combination of the others. It is difficult or impossible to identify from the pairwise scatter plots. In such situation, we look into the variance inflation factor. It is computed like this: Suppose we have p many predictors, X_1, X_2,...,X_p. We regress X_1 on X_2,...,X_p by a linear model and compute the coefficient of determination, call it R^2_1|-1. Do this process for X_2 as well with X_1,X_3,...,X_p as predictors and record the coefficient of determination values. So, there will be p many R-square values. Note that a high value of R-square (R-square_i|-i, i=1,2,...,p) indicates a strong indication of linear relationship among the predictors. Compute the variance inflation factor as VIF(beta_hat_j) = 1/(1 - R-square_j|-j) for j=1,2,...,p. High value VIF indicates which variable is responsible for inflating the variance or source of multicollinearity.

(Homework) Create a case study to demonstrate the impact of multicollinearity. Identify the source of multicollinearity by computing VIF on your own. R function VIF is available there. You can cross check it.

The third topic we discussed, was the Leave one out cross validation, K-fold cross validation. Read the sections 5.1 (Cross validation): subsections: 5.1.1 - 5.1.4. Lecture material on Software Lab are available with codes. A brief outline is given here:

[Leave One Out Cross Validation (LOOCV)]

Suppose that we have the dataset (x_i, y_i), i = 1, · · · , n. By this method, the set of observations are divided into two parts; the test set contains only one observation (say (x_1, y_1) ) and the training set contains the data (x_i, y_i) i = 2, · · · , n. The model is fitted on the training set and test MSE is computed by predicting the response y1 in the test set, call it MSE_1. The process is continued for (x_2, y_2),...,(x_n, y_n) and MSE_2,...,MSE_n are computed respectively. The average test MSE is computed as

MSE =(MSE_1+MSE_2+ · · · +MSE_n)/n.

[K-Fold Cross Validation (K-Fold CV)]

The sample is divided into k non-overlapping equal size sets. For each k sets we use k-1 fold for training and the remaining for testing. The mean squared error, MSE_1, is then computed on the observations in the held-out fold. This procedure is repeated k times, each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE_1, MSE_2, . . . , MSE_k. The k fold CV estimate for the test MSE is computed by averaging these values

MSE =(MSE_1+MSE_2+ · · · +MSE_k)/k

  • LOOCV is a special case of the k-fold cross validation, where k is chosen to be equal to the number of observations.

  • LOOCV is time consuming due the number of experiments performed, if the dataset is large enough.

  • With k-fold cross-validation, the computation time is reduced due to the number of experiments performed is less as well as the all the observations are eventually used to train and test the model.

In the following code we check the behaviour of the test MSE for a 50-50 train test validation. As we see that the estimates of the test MSE varies enormously across different test sets. The reason is that the prediction heavily depends on what kind of observations are included in the test set and training set. In addition there is no guarantee that each observation is used as a training set or as a test set. Hence, for some training set the model might perform very well, but the performance on the test set may not be satisfactory and vice-versa. In such a situation, we need a resampling method that guarantees that each of the observations are being included in the training and also in the test set. Leave One Out Cross Validation is one such method. In LOOCV we use the data point (x_1,y_1) as test set and rest (n-1) observations are used to train the statistical learning algorithm. Similar process is carried out for (x_2,y_2), (x_3,y_3)... and so on. So essentially we end up with n training and test mse which are averaged to get the LOOCV estimate of training and test mse. The model with the smallest test mse may be considered for prediction. However, note that we need to build the model n times, which is really computationally expensive. So we need some resampling technique that is computationally less expensive and simultaneously ensures that all the observations are being used in both training and testing purpose. This optimization can be performed using the K-fold cross validation. Essentially, we separate the complete set observations into K distinct subsets (say B_1, B_2,...,B_k). We train the model using the training subsets B_2,B_3,..,B_k and test the performance of the model on B_1. Similar process is carried out with all subsets. Essentially, we end up with K training and test mse values, which are averaged to obtain the estimates of expected training and expected test mse. The model with the smallest test mse can be considered for prediction. Note the here the number of times the training would be carried out, is small. Suppose, we have n=100 and K = 10, then for LOOCV 100 times the algorithm needs to be run, whereas for K-fold cross validation only 10 times the algorithm needs to be done. It is a very powerful method and commonly appears in the real data analysis which are published in fact in top tier journals. In the following code change the value of prop_train and check the behaviour of training and test mse. You may also do experimentation using different values of population standard deviation sigma. It is also suggested to explore it by varying the degree of non-linearity in the population relationship of x and y.

par(mfrow=c(1,1))

x = seq(1, 10, length.out = 100)

b_0 = 1

b_2 = 2

sigma = 0.8

y = b_0 + b_1*x + rnorm(length(x), 0, sigma)

plot(x,y,col = "red")

data = data.frame(x,y)

prop_train = 0.8

rep = 10

training_mse = matrix(data = NA, nrow = rep, ncol = length(degree))

test_mse = matrix(data = NA, nrow = rep, ncol = length(degree))


for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[train_index, ]

test_data = data[-train_index,]


for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

fit_poly = polynomial(coef = coef(fit))

fun_poly = as.function(fit_poly)

if(i == 1){

curve(fun_poly(x), add = TRUE, col = d)

}

# computation of training mse

y_hat = predict(fit, newdata = train_data)

training_mse[i, d] = mean((y_hat - train_data$y)^2)

# computation of test mse

y_pred = predict(fit, newdata = test_data)

test_mse[i, d] = mean((y_pred - test_data$y)^2)

}

}

matplot(test_mse, type = "l", lwd=2, xlab = "Flexibility")

matplot(training_mse, type = "l", lwd=2, xlab = "Flexibility")

Machine Learning: Lecture - VII (August 6, 2018) (01:30 PM - 03:30 PM)

In the previous lecture, we have discussed and compared some resampling methods. These methods are basically used to estimate the expected training error and expected test error rate. From today's lecture, we will concentrate more on the statistical properties of machine learning methods. Without having any dilemma, we shall refer to different techniques like machine learning methods instead of statistical methods, although, most of the methods were originated much earlier time before the term machine learning became popular.

Recall the concept of parametric methods, in which we assumed some mathematical functions relating to the predictors and the response. So the problem of understanding the data boils down to estimating the parameters of the assumed function (model) efficiently. There are several estimation processes are made available for estimating the parameters. Today, we start with our discussion with two popular methods: (a) Maximum Likelihood estimation and (b) Method of Least Squares. Due to limitations of mathematical annotations in the google web page, I am giving here book names and pages which have been covered during the lecture. We started with a review of the method of maximum likelihood estimation by giving an example, where we estimated the rate parameter of the Poisson(lambda) distribution based on a sample of size n. We recalled the likelihood equation.

(Homework) Write the log-likelihood function for the Poisson(lambda) distribution based on a sample of size n. Use the function optim() from R to minimize the negative log-likelihood function. Check your result with the analytical calculation.

We have studied the following things during the class: Book: All of Statistics by Larry Wasserman

  • Definition of the Simple Linear Regression Model (Section 13.1)

  • Derivation of Maximum Likelihood and Least Squares Estimators for Simple Linear Regression.

  • Theorem: Under the assumption of normality, the least-squares estimator is also the maximum likelihood estimator.

  • Properties of the Least Squares Estimators. (Theorem 13.8) (Homework)

  • We have also discussed the consistency, asymptotic normality of the LSE. Also derived the (1-alpha)% confidence interval and Wald test to reject the null hypothesis that H_0: beta_1=0 versus H_1: beta_1 is not equal to 0.

  • Then we moved back to multiple linear regression. Since the derivations are already done, we did not derive the results here again. Although we have discussed the unbiased estimators of sigma^2.

  • Then we have discussed model selection. We have defined the prediction risk. We have also seen the training error is a downward bias estimate of the prediction risk. The reason for this bias is that the data are being used twice: to estimate the parameters and to estimate the risk. The bias is -2*Cov(Y_hat_i, Y_i). Thus when we fit a very complex model with many parameters, the covariance will be large and the bias of the training error gets worse. Thus we need some other measures to estimate the risk. In the next lecture, we shall discuss other measures of prediction risk.

(Homework) Consider the regression model through the origin model: Y_i = beta* X_i + e. Find the least-squares estimate for theta beta. Find the standard error of the estimate. Find the conditions that guarantee that the estimate is consistent.

(Homework) Exercises 13.6, Problem number: 6, 7 from All of Statistics by Larry Wassarman. These problems are related to real data.


Machine Learning: Lecture - VIII (August 9, 2018) (10:30 AM - 12:30 PM)

In today's lecture, we shall explore different measures of prediction risk. In the previous lecture we have seen the training error is a downward biased estimate of the prediction risk R(S). S is a subset of {1,2,...,p} where p is the number of predictors. and Chi_S = {X_j | j belongs to S}. beta_S denote the coefficients of the corresponding set of predictors. X_S denote the design matrix associated with this subset of predictors. Prediction Risk R(S) = sum_{i=1}^nE(Y_hat_i(S) - Y_i^*)^2, where Y_hat^* denote the value of a future observation of Y_i at the covariate value X_i.

Now to chose a model with an optimal level of complexity we need to estimate the prediction error more accurately. We Since training error rate has a downward bias, it can not be used as it will always tend to select the model with higher complexity level. Note that, in the present discussion the complexity is understood in the following sense: a linear regression regression model with a large number of predictors has more complexity than the model with less number of predictors. As the model complexity increases the training error rate (R_hat_tr(S)) decreases monotonically. If we keep on adding covariates, then training error will decrease. Note that R_hat_tr(S) essentially talks about the lack of fit that does not bother (or insensitive) to the increasing complexity (hence tend to perform worse on the test set). So we need to devise some measure to estimate R(S) (prediction risk) that incorporates both lack of fit and the complexity of the model. Mallow's Cp is an improved estimator of R(S), which is defined as R_hat(S) = R_hat_tr(S) + 2|S|sigma_hat^2, where |S| is the number of predictors in the set S and sigma_hat^2 is an estimate of sigma^2 which is obtained from the full model, that is, the model with all the predictors included. Thus Mallow's Cp = Lack of Fit + Bias correction.

The above discussion essentially reflects the fact that finding a good model involves trading of fit and model complexity. Another important measure is AIC (Akaike Information Criterion), which is defined as AIC(S) = log-likelihood(S) - |S|, this is basically of the form: goodness of fit - complexity.

(Homework) Consider a model y = f(x) + e, with f(x) = 0.1 + 2*sqrt(x) -sin(x), and sigma = 0.3 (the population relation). Simulate a data set of size n=100. Fit polynomials of varying degree (1:10) and each case compute the AIC. Check whether AIC is selecting the same model as selected by the K-fold cross validation. In the class, we have already derived the expression for the log-likelihood and elaborated that how the AIC needs to be computed.

(Homework) Assume a linear regression model with normal errors. Take sigma to be known. Show that the model with the highest AIC is the model with the lowest Mallow's Cp statistic.

We have also discussed the LOOCV and K-fold CV estimators of prediction risk. Refer to the previous lecture for elaboration explanation of the these concepts.

(Homework) In the previous lecture, we have seen that if a data set contains n rows, then to compute the LOOCV error the model needs to be build n times, which may not be computationally feasible. However, this laborious job can be shortened by using the fact that R_LOOCV_hat(S) = sum_{i=1}^n(Y_i - Y_i_hat(S))^2/(1 - U_{ii}(S)), where U_ii(S) is the ith diagonal element of the matrix U(S) =X_S (X_S'X_S)'X_S'. Try to identify the matrix from the linear model lectures. This formula essentially states that we do not need to build the model n times. Prove the formula.

Another measure, that we have discussed is that BIC (Bayesian Information Criterion), which is defined as BIC(S) - log-likelihood(S) - (|S|/2)*log(n). So it is understood that we will select the model that maximizes the AIC or BIC and minimizes Mallow's Cp.

An interesting fact is that for linear regression, Mallow's Cp and cross-validation often yield essentially the same results so one might as well use Mallow's Cp method in linear regression contexts. However, for more complex problems cross-validation is preferred.

BIC is almost identical with AIC except that it puts a more severe penalty for complexity. Thus it leads to chose a smaller model than the other methods. The BIC has a Bayesian interpretation. Let S = {S_1,...S_m} be m candidate models out of which we are planning to select the best model. We assign a discrete uniform prior on the set S, that is, P(S_i) = 1/m, i = 1,2,...,m. It can be shown that the posterior probability for a model is approximately, P(S_j|data ) = exp(BIC(S_j))/sum_{k=1}^m(exp(BIC(S_k))).

We have also discussed about the prior probability, Bayes theorem, posterior probability distribution. We put light on the idea that posterior is proportional to the product of likelihood and prior.

We have studied the above sections from the book All of Statistics by Larry Wasserman (Chapter 13). Our next lecture plan is (a) Forward Selection and Backward Selection of Features (b) Zheng-Loh Model Selection Method.

(Homework) All of Statistics: Chapter 13: Problem No. 6, 7 (data related)

(Homework) All of Statistics: Chapter 13: Problem No. 9, 10 (conceptual)


Machine Learning: Lecture - IX (August 10, 2018) (01:30 PM - 03:30 PM)

We have started our discussion with two available formulas of Cp. In the previous lecture, we have seen that Mallow's Cp = R_hat_tr(S) + 2|S|sigma_hat^2. Another form of Cp, let us call it Cp' which is written as Cp' = R_hat_tr(S)/sigma_hat^2 + 2|S|-n. It is to be noted that both Cp and Cp' are equivalent, as there is a one-to-one correspondence between these two formulas which is given by Cp = sigma_hat^2(Cp' +n). So the model with the smallest Cp is the same model with the smallest Cp'.

After we got an understanding of different selection criteria, we decided to select the model out of several candidate models (nested!). We studied four different selection criteria. Here I am not writing them again but shall mention the details. The Section 6.1 from ISLR. Best Subsection Selection (Algorithm 6.1, Subsection 6.1.1), Forward stepwise selection (Algorithm 6.2, Section 6.1.2 and 6.1.3). We have also discussed the hybrid approach and motivation for Adjusted R^2.

(Homework) A short note on Akaike Information Criterion and Kullback-Liebler information criterion is uploaded in the following link: Click on A Short Note On AIC. Solve three problems given in the note.

In the following we compute the K-L information by using the integrate() function from R. Check the assignment in the above short note on AIC. For numerical integration in Python one cane use integrate() function from scipy.stats library.

par(mfrow=c(1,1))

alpha = 4

beta = 4

fun_gamma = function(x){

(1/beta^alpha)*exp(-x/beta)*x^(alpha-1)/gamma(alpha)

}

curve(fun_gamma(x), 0, 70, col = 1, ylab = "density", lwd=2, ylim=c(0,0.1))


alpha_w = 2 # note the change in parameter naming. Do not use alpha and beta again

beta_w = 20

fun_Weibull = function(x){

(alpha_w/beta_w)*(x/beta_w)^(alpha_w-1)*exp(-(x/beta_w)^alpha_w)

}

curve(fun_Weibull(x), 0, 70, col = 2, ylab = "density", add = TRUE, lwd=2, lty=2)

legend("topright", legend = c("gamma (4,4)", "Weibull (2,20)"),

bty= "n", col = c(1,2), lwd=c(2,2), lty = c(1,2))

fun_KL = function(x){

fun_gamma(x)*log(fun_gamma(x)/fun_Weibull(x))

}

integrate(fun_KL, 0, 100)

curve(fun_KL(x), 0, 1100)

Machine Learning: Lecture - X (August 20, 2018) (01:30 PM - 03:30 PM)

Today, we have mainly discussed various facts related to AIC and its connection to likelihood. We have also discussed a historical development of AIC. Most of them are discussed in the Short Note on AIC, which has been uploaded in the previous day's lecture. We have mainly discussed the connection between Kullback-Liebler information and its connection to AIC. In the following, I am giving the R codes to compute AIC. We use the definition following the book All of Statistics, so that that AIC = log(maximum likelihood) - number of parameters. Let us compute the AIC to choose the optimal order of the polynomial. This is the same problem that we have discussed in connection to the estimation of prediction error. At the end, we also compare the performance of AIC and K-fold cross validation. The following codes are mainly developed my student Dipali Mestry. So all credit goes to her for today's updates. In the following, we simulated the response variable (y) using the function f(x) = 0.1 + 2*sqrt(x) -sin(x), where x is chosen from the interval (1,5). In the following program, we have considered the linear regression set up with normally distributed errors. We have to look for the model that has the maximum AIC. The following code must be run using a training sample, however, the complete data has been used. The following code is only for demonstration purpose.

Check the behavior of the data carefully. Although, we have used some nonlinear function, but the behavior of the data is essentially moderately linear.

library(polynom)

par(mfrow=c(1,2))

sigma = 0.3 # population error specification

n = 100

x = seq(1,5,length.out = n) # predictor variable

y = 0.1 + 2*sqrt(x) -sin(x)+rnorm(n,0,sigma) # simulated response

data=data.frame(x,y) # store predictor and response in data.frame

plot(x,y,col="red") # look at the plot

degree = 1:10 # order of the polynomials to be tested

AIC=numeric(length(degree)) # allocation for computed AIC values

for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data =data)

sigma_hat_sq=sum((fit$residuals)^2)/n

fit_poly = polynomial(coef = coef(fit))

f=as.function(fit_poly) # creating function object

curve(f, col = d, add=TRUE) # add the fitted function on the plot

# computation of likelihood for AIC computation.

likeliood=prod((1/(sqrt(2*pi)*sqrt(sigma_hat_sq)))*exp(-(1/(2*(sigma_hat_sq)))*(y-f(x))^2))

s = d + 1 + 1 # coefficients + intercept + sigma

AIC[d]=log(likeliood)-s # As per the definition of All of Statistics

}

AIC

max(AIC)

plot(degree,AIC,col="red",main="using AIC", lwd=2)

ind = which.max(AIC)

points(degree[ind], AIC[ind], pch = "*", lwd=2, cex= 3, col ="blue")

In the following, we compare the above problem in the light of train-test validation. We check whether train test validation also gives the same model as obtained by AIC values. We consider the training proportion as 0.7 and the process has been carried out rep = 100 times to compute the average prediction error.

n=100

x=seq(1,5,length.out = n)


sigma = 0.3# population error standard deviation

y=0.1 + 2*sqrt(x) -sin(x)+rnorm(n,0,sigma)


# Lets store them in a data frame

data = data.frame(x,y)


rep = 100

degree = 1:10

train_prop = 0.7


training_mse = matrix(data = NA, nrow = rep, ncol = length(degree))

test_mse = matrix(data = NA, nrow = rep, ncol = length(degree))


training_bias = matrix(data = NA, nrow = rep, ncol = length(degree))

test_bias = matrix(data = NA, nrow = rep, ncol = length(degree))


library(ISLR)

for(i in 1:rep){

train_index = sample(1:length(x), size = train_prop*length(x), replace = FALSE)

train_data = data[train_index, ]

test_data = data[-train_index,]

for(d in degree){

fit = lm(y ~ poly(x, degree = d, raw = TRUE), data = train_data)

fit_poly = polynomial(coef = coef(fit))

fun_poly = as.function(fit_poly)

# computation of training mse

y_hat = predict(fit, newdata = train_data)

training_mse[i, d] = mean((y_hat - train_data$y)^2)

# computation of test mse

y_pred = predict(fit, newdata = test_data)

test_mse[i, d] = mean((y_pred - test_data$y)^2)

}

}


avg_training_mse = colMeans(training_mse)

avg_test_mse = colMeans(test_mse)

y_lim = c(min(avg_training_mse, avg_test_mse), max(avg_training_mse, avg_test_mse))

plot(degree, avg_training_mse, col = "red", lwd=2,

type = "l", ylim = y_lim,main="k-fold")

lines(degree, avg_test_mse, col = "blue", lwd=2)

legend("topleft", c("train MSE", "test MSE"), col = c("red","blue"), lwd=c(2,2), bty="n")

points(3, min(avg_test_mse), col="black", lwd = 3)

Non-parametric bootstrap method can be used in the estimation of model selection frequencies (pi_i) and it estimates of precision that include model selection uncertainty. The fundamental idea of the model based sampling theory approach to statistical inference is that the data arise as a sample from some conceptual probability distribution f. Bootstrap method allows the computation of measures of our inference uncertainty by having a simple empirical estimate of f and sampling from this estimated distribution. More discussion we will do in the later lectures. The following contains the algorithm to estimate the model selection frequencies using Bootstrap; this is essentially gives an empirical distribution depicting the likelihood of the models over the model space.

B_dataset = matrix(NA,nrow = n,ncol = 2)

degree = 1:10

AIC = numeric(length(degree))

max_AIC_model_no=numeric(B)

for(j in 1:B){

index = sample(1:length(x), size = n, replace = TRUE)

B_dataset = data[index,]

new_data = data.frame(B_dataset[,1],B_dataset[,2])

for(d in degree){

fit = lm(new_data[,2] ~ poly(new_data[,1], degree = d, raw = TRUE), data =new_data)

sigma_hat_sq=sum((fit$residuals)^2)/n

fit_poly = polynomial(coef = coef(fit))

f=as.function(fit_poly)

likeliood=prod((1/(sqrt(2*pi)*sqrt(sigma_hat_sq)))*exp(-(1/(2*(sigma_hat_sq)))*(new_data[,2]-f(new_data[,1]))^2))

s=d+1

AIC[d]=log(likeliood)-s

}

AIC

t = which.max(AIC)

max_AIC_model_no[j]=t

}

prob_model_selection=numeric(length(degree))

for(i in degree){

t=match(max_AIC_model_no,i)

n1=length(t)-length(which(is.na(t)))

prob_model_selection[i]=n1/B

}

prob_model_selection

which.max(prob_model_selection)

barplot(prob_model_selection,names.arg = as.character(degree))

As we can see that the polynomial of degree three has the highest frequency. This is in agreement with the models selected by the AIC and train test validation.

Machine Learning: Lecture - XI (August 24, 2018) (01:30 PM - 03:30 PM)

Machine Learning: Lecture - XII (August 27, 2018) (01:30 PM - 03:30 PM)

The above two days lectures are updated in the following material. The material also includes tutorial problems. Click the link: Introduction to Bayes Estimation

Mid Semester Examination (September 10, 2018): Question Paper


Machine Learning: Lecture - XVI (August 27, 2018) (01:30 PM - 03:30 PM)

Machine Learning: Lecture - XVII (October 5, 2018) (01:30 PM - 03:30 PM)

We have continued the discussion on Univariate Delta Method and continued to the Multivariate Delta Method. We have considered examples of ratio estimator to demonstrate the use of delta method for two variable case. Univariate delta method was demonstrated using the following R code. A pdf containing the details of Delta method would be updated. But, we mainly discussed the topics from Page 240, Section 5.5.4, Casella and Berger (2002) and Section 9.9 from Chapter 9, Wasserman (2005).

Demonstration of Delta Method:

n = 1000 # Size of the sample, consider varying it and investigating the histogram

lambda = 2

x = rpois(n, lambda = lambda)

hist(x, probability = TRUE)


rep = 10000 # Required to visualize the distribution

sample_mean = numeric(rep)

vals = matrix(data = NA, nrow = rep, ncol= n)

for(i in 1:rep){

vals[i,] = rpois(n, lambda = lambda)

}

sample_mean = rowMeans(vals)

hist(sample_mean, probability = TRUE, col = "lightgrey", breaks = 70)

hist(sqrt(n)*(sample_mean-lambda),

probability = TRUE, col = "grey",main = bquote(sqrt(n)(bar(X)-lambda)))

#Validity of the central limit theorem

curve(dnorm(x, mean = 0, sd = sqrt(lambda)),

add = TRUE, col = "red", lwd=2)



# Delta method

hist(1/sample_mean, probability = TRUE, main = bquote(1/bar(X)), col = "grey")

points(1/lambda, 0, pch = 19, lwd=2, cex=2,col= "red")

mean(1/sample_mean)


hist(sqrt(n)*(1/sample_mean-1/lambda), probability = TRUE,

col = "grey",main = bquote(sqrt(n)(1/bar(X) - 1/lambda)))

curve(dnorm(x, mean = 0, sd = sqrt(1/lambda^3)), col = "red", lwd=2, add = TRUE)


mean(1/sample_mean^3)

1/lambda^3

Machine Learning: Lecture - XVIII (October 8, 2018) (01:30 PM - 03:30 PM

L_1 and L_2 regularization. Ridge and Lasso Regression. Connection to Bayesian.


Machine Learning: Lecture - XIX (October 12, 2018) (01:30 PM - 03:30 PM)

Machine Learning: Lecture - XX (October 15, 2018) (01:30 PM - 03:30 PM)

The following topics are covered in the above two lectures

  • Introduction to Multivariate Statistics: Mean vector, Variance Covariance Matrix, Standard deviation matrix, Correlation matrix. Ref: Chapter 3, Applied Multivariate Statistical Analysis by Johnson and Wichern.

  • Principal Component Analysis (PCA): Concepts of Population Principal Components, Maximization of the quadratic forms subject to constraints, Spectral decomposition, Computation of Principal components, Related problems, Geometry of the principal directions. Ref: Chapter 8, Applied Multivariate Statistical Analysis by Johnson and Wichern. Some notes related to matrix algebra in connection to PCA is attached here.

  • Principal Component Regression (PCR): Ref: Chapter 6, ISLR, Section: 6.3: Dimension Reduction Methods, Section 6.3.1: PCR.

The number of components can be decided by using Cross validation in the PCR. Useful R functions are pcr, validationplot from the library pls.

Machine Learning: Lecture - XXI (October 19, 2018) (01:30 PM - 03:30 PM)

Factor Analysis: Chapter 9, Applied Multivariate Statistical Analysis by Johnson and Wichern.

Estimation of parameters in Factor Model: Principal Component Factor Analysis

Multivariate Normal Distribution: Chapter 4, Applied Multivariate Statistical Analysis by Johnson and Wichern.

Machine Learning: Lecture - XXII (October 22, 2018) (01:30 PM - 03:30 PM)

Chapter 7. Moving Beyond Linearity: (Polynomial Regression, Piecewise Constant Regression Models, Basis Function Approach, Piecewise Polynomials, Spline basis representation, Constraints and Splines, Introduction to Generalized Additive Models


Machine Learning: Lecture - XXIII (October 25, 2018) (01:30 PM - 03:30 PM)

Chapter 7: ISLR - Spline Basis Representation for fitting cubic splines, How to chose the number of knots and location of knots, idea of smoothing splines and the choice of the tuning parameter, The idea of local regression and its computation, some further discussion on generalized linear models and smoothing splines.


Machine Learning: Lecture - XXIV (October 26, 2018) (01:30 PM - 03:30 PM)

Verification of the fact that the spline basis representation is indeed a cubic spline. Fitting Generalized Additive Models. Backfitting algorithm. Example case for linear regression with two predictors. Introduction to orthogonal regression using the example of simple linear regression. Notion of inner products, projection and Gram Schmidt orthogonalization.

The following is an example of the backfitting algorithm for multiple linear regression:

n = 100

x1 = runif(n, min = 0, max = 1)

x2 = runif(n, min = 0, max = 1)


beta_0 = 0.7

beta_1 = 0.5

beta_2 = 0.3

sigma = 0.2


y = beta_0 + beta_1*x1 + beta_2*x2 + rnorm(n, mean =0, sd= sigma)


fit = lm(y ~ x1 + x2)

coef(fit)


maxiter = 10

est_beta_0 = rep(NA, maxiter)

est_beta_1 = rep(NA, maxiter)

est_beta_2 = rep(NA, maxiter)


est_beta_1[1] = 40 # initialization of beta_1

for(i in 1:maxiter){

a = y - est_beta_1[i]*x1

est_beta_0[i] = lm(a~x2)$coef[1] # coefficient update on beta_0

est_beta_2[i] = lm(a~x2)$coef[2] # coefficient update on beta_2

a = y- est_beta_2[i]*x2

est_beta_1[i+1] = lm(a~x1)$coef[2] # coefficient update on beta_1

}

plot(1:maxiter, est_beta_0, col = "red", type = "l",

ylim=c(0,1), ylab = "Coefficient estimates")

lines(est_beta_1, col = "blue", lwd=1)

lines(est_beta_2, col = "green", lwd=1)

legend(8, 1, c(expression(beta[0]), expression(beta[1]), expression(beta[2])), lwd=c(1,1,1), col = c("red", "blue", "green"), bty = "n")

Machine Learning: Lecture - XXV (October 27, 2018) (01:30 PM - 03:30 PM)

Today we have discussed the situation in regression where we have categorical predictors. We discussed about the creation of dummy variables. We discussed about the interpretation of the parameters when new dummy variables are introduced. The complete thing has been taught using R. We have also discussed the application ANOVA in comparing nested models.

x = c(rep("M", 10), rep("F", 10))

y = c(runif(10, 100, 200), runif(10, 150, 210))

data = data.frame(y, x)

dummy_x = rep(NA, length(x))


for(i in 1:length(x)){

if(x[i] == "M"){

dummy_x[i] = 1

}

else{

dummy_x[i] = 0

}

}

dummy_x = as.factor(dummy_x)

newdata = data.frame(y, dummy_x)

View(newdata)

fit = lm(y ~ dummy_x, data = newdata)

summary(fit)


fit1 = lm(y ~ x, data = data)

coef(fit1)


# new encoding of dummy variables

for(i in 1:length(x)){

if(x[i]== "F")

dummy_x[i] = -1

else

dummy_x[i] = 1

}

dummy_x = as.factor(dummy_x)

newdata = data.frame(y, dummy_x)

tail(newdata)

fit2 = lm(y ~ dummy_x, data = newdata)

coef(fit2)

mean(y)

# So what we have learnt that the interpretation of the parameters depend on the choice of the dummy variable encoding

# One has to take caution while dealing with real data


x = c(rep("M", 10), rep("F", 10), rep('T', 10))

y = c(runif(10, 100, 200), runif(10, 150, 210), runif(10, 90,150))

data = data.frame(y, x)

summary(data)

fit3 = lm(y ~ x, data = data)

summary(fit3)


# This is a homework for the encoding as described in the board


# Another example for model selection

# There are different ways to chose the optimal degree of the polynomial

n = 100

x = runif(n = n , 0,1)

beta_0 = 0.4

beta_1 = 0.3

beta_2 = 0.7

sigma = 0.1

y = beta_0 + beta_1*x + beta_2*x^2 + rnorm(n, 0, sigma)

plot(x, y)


data = data.frame(y, x)

fit_1 = lm(y ~ 1, data = data)

summary(fit_1)

range_x = range(x)

test.x = seq(range_x[1], range_x[2], length.out = 100)

pred_y = predict(fit_1, newdata = as.data.frame(test.x))

lines(test.x, pred_y, col = "red", lwd=2)


fit_2 = lm(y ~ x, data = data)

summary(fit_2)

test.x = seq(range_x[1], range_x[2], length.out = 100)

pred_y = coef(fit_2)[1] + coef(fit_2)[2]*test.x

lines(test.x, pred_y, col = "blue", lwd=2)


fit_3 = lm(y ~ x + I(x^2), data = data)

summary(fit_3)

test.x = seq(range_x[1], range_x[2], length.out = 100)

pred_y = coef(fit_3)[1] + coef(fit_3)[2]*test.x + coef(fit_3)[3]*test.x^2

lines(test.x, pred_y, col = "green", lwd=2)


anova(fit_1, fit_2)

anova(fit_2, fit_3)


fit_4 = lm(y ~ x + I(x^2) + I(x^3), data = data)

summary(fit_4)

test.x = seq(range_x[1], range_x[2], length.out = 100)

pred_y = coef(fit_4)[1] + coef(fit_4)[2]*test.x + coef(fit_4)[3]*test.x^2 + coef(fit_4)[4]*test.x^3

lines(test.x, pred_y, col = "magenta", lwd=2)


anova(fit_3, fit_4)


anova(fit_1, fit_2, fit_3, fit_4)

Machine Learning: Lecture - XXVI (October 27, 2018) (01:30 PM - 03:30 PM)

Logistic Regression, Interpretation of parameters, Estimation of parameters by likelihood estimation, Regularization in a logistic model fitting, Bayes Theorem in a classification set up.


Machine Learning: Lecture - XXVII (October 29, 2018) (01:30 PM - 03:30 PM)

A Session on Machine Learning using MATLAB


Machine Learning: Lecture - XXVIII (October 29, 2018) (03:30 PM - 05:30 PM)

Regression Tree and Classification Tree, Cost-Complexity pruning, Bagging, Boosting, Random forest, Interpretation of Regression Tree (ISLR)

Machine Learning: Lecture - XXIX (October 30, 2018) (01:30 PM - 03:30 PM)

Projection Pursuit Regression (PPR), Fitting procedure, Quasi-Newton approximation, Architecture of Neural Network, Understanding the optimization process and its efficiency in approximating high nonlinearity. This section will be further updated soon. (Elements of Statistical Learning, Chapter 11.