Statistics Lab

The following function allows you to visualize the distribution of the sample mean from various distribution. n=number of sample you are drawing, m = number of times the samples are drawn. The central limit theorem says that (under certain conditions), when the sample size is large, the distribution of sample mean behaves like a normal distribution (whatever be the underlying population distribution from which the random samples are drawn) - Modify the following code to accept user inputs for the parameters of the distribution. Try to understand the use of sapply function. - Plot the normal density on the histogram of the sample mean. 

DemoCLT = function(n=50, m=40, distribution="normal"){

    if(distribution=="normal")

        data = matrix(rnorm(n*m, mean=10,sd=2), nrow=n)

    if(distribution == "uniform")

        data = matrix(runif(n*m, min=0, max=3), nrow = n)

    if(distribution == "beta")

        data = matrix(rbeta(n*m, shape1 = 1, shape2=2), nrow=n)

    if(distribution == "exponential")

        data = matrix(rexp(n*m, rate = 1), nrow=n)

    if(distribution == "binomial")

        data = matrix(rbinom(n*m, size = 20, prob = 0.5), nrow=2)

    colnames(data) = paste("sample", as.character(1:m))

    SampleMeans = sapply(colnames(data), function(x) mean(data[,x]))

    par(mfrow = c(1,2))

    hist(SampleMeans, prob=T, col="lightgrey", main = "Distribution of Sample Mean")

    qqnorm(SampleMeans, col="red", lwd=2)

    qqline(SampleMeans, col ="blue", lwd=2)

    mtext(paste("Distribution of Mean for n = ", as.character(n)), outer = TRUE, cex = 1.5)

    result = shapiro.test(SampleMeans)

    print(result)

}

DemoCLT(n=1, m=400, distribution = "uniform")

DemoCLT(n=5, m=400, distribution ="uniform")

#Note the use of complex class in R. Various operations on complex numbers is also available. help(complex)

Exercise-4. In a statistical application of your choice, what does a missing value mean? What are the traditional methods of imputing missing values in such an application?

Exercise-5. Create vectors of each of the different primitive types. Create matrices and arrays by attaching dim attributes to those vectors. Look up the help for dimnames and attach dimnames to a matrix with two rows and five columns.

Exercise-6. Write an R function that takes a matrix as input and standardizes each row by subtracting the median and dividing by the MAD (Mean Absolute Deviation). You might want to look at the functions mad and median. Next, generalize your function so that these are the default values, but allow the user to select other measures of the center and spread, should they want to use them.

Exercise-7. Write a function that takes $x$ and $n$ an input and returns the value of $$1+\frac{x}{1!} + \frac{x^2}{2!}+...+ \frac{x^n}{n!}.$$ Note that this gives approximation of the function $e^x$. Now Suppose, we want to get an estimate of error, while approximating exponential function with the series in some interval $[a,b]$. Write a program to take user input for $a$ and $b$, $x$ and $n$. Estimate the maximum error in the interval. Can you find a suitable $n$ for which the error is less than some tolerance limit

Some useful functions in R which are very helpful during data analysis:

is.factor(), is.logical(), as.data.frame(), as.numeric() etc. 

R can deal with complex numbers also 

z = complex(real = 2, imaginary = 3); z

Re(z); Im(z); Mod(z); Arg(z); Conj(z)

Resampling Techniques.

Basic concepts and terminologies

Suppose that we observe a quantitative response Y and p different predictors,  X1,X2, · · · ,Xp. We assume that there is some relationship between Y and X = (X1,X2,  · · ·  ,Xp),  which can be written in the very general form 

Y = f(X) + e 

Here f is some fixed but unknown function of X1,X2, · · · ,Xp, and e is a random error term with  mean zero, which is independent of X and has mean zero. In this formulation, f represents the  systematic information that X provides about Y . The idea of statistical learning deals with several  approaches to estimating the function f.  The reasons behind estimating the function f are two folds, namely “prediction” and “inference”. In many real life problems the information about the input variables are readily available, however, the output Y is difficult to obtain. In such cases by estimating f we can predict Y using

Y_hat = f_hat(X)

where f_hat represents the estimate of f and Y_hat represents the resulting prediction of Y . The above equation is reasonable prediction as e averages to zero. You are requested to run the following R code and understand the output (refer to the class note). These will be used for the programs developed for different resampling techniques again and again.

set.seed(12345)                         # Ensures the reproducibility of the simulations

x=runif(n=100, min=0, max=1)            # Generate 100 numbers from U(0,1) distribution

y=2+3*x^2+rnorm(n=100, mean=0, sd=0.5)  # Simulation of response variable, see help(rnorm)

y                                       # Print the response y

plot(x,y)                               # Scatterplot of x and y

data=data.frame(x,y)                    # Store data in a dataframe with name "data"

print(data)                             # Prints the contents of the data

fit=lm(y~x, data=data)                  # Fit a linear regression y=b_0+b_1x

summary(fit)                            # Get summary of the fitting exercises, MOST IMPORTANT

abline(fit, col=29, lwd=2)              # Fitting the estimated regression equation to the scatterplot

fit=lm(y~x+I(x^2)+I(x^3),data=data)     # Fitting cubic regression eqution, carefully check I(x^2)

str(fit)                                # Useful command to get structure of the class

library(ISLR)                           # Install the package ISLR (if not installed)

fit=lm(y~poly(x,10),data=data)          # poly function takes of regression of different order                                                   # polynomials. No need to write I() everytime. 

plot(x,y)                                 

fit=lm(y~x+I(x^2),data=data)

x1=seq(from = 0, to=1, by=0.1)

y1=coef(fit)[1]+coef(fit)[2]*x1+coef(fit)[3]*x1^2

lines(x1,y1, col=3, lwd=2)              # Plot the estimated polynomial regression equation

help(curve)                             # Explore curve function for quick plots

 

Validation Set Approach

Validation set approach is a simple resampling method to select a suitable statistical learning method. In this method, the available set of observations is randomly divided into two parts. One subset is called the training set and the other one is called the validation set or hold-out set. The model is fitted to the training data and estimated model is used to predict the responses in the validation set. The validation set error rate is assessed by using MSE in case of quantitative response. Validation set error rate is nothing but the estimated test error rate. In the next paragraph and we demonstrate the method by a simulation study using R.

library(ISLR)

set.seed(12345)

x = runif(n=100, min = 0, max = 1)

y = 2 + 3 * x^2 + rnorm(n=100, mean = 0, sd = 0.5)

data = data.frame(x,y)

par(mfrow = c(1,2))

deg = 10

run = 10

tmp.train.mse = matrix(NA, nrow = run, ncol = deg)

validation.mse = matrix(NA, nrow = run, ncol = deg)

for (i in 1:run) {

  train = sample(1:nrow(data), size = floor(0.5*nrow(data)), replace = FALSE)

  train_ = data[train,]

  validation_ = data[-train,]

  for (j in 1:deg) {

    fit = lm(y ~ poly(x, j), data = train_)  

    tmp.train.mse[i,j] = sum(fit$residuals^2)/nrow(train_)

    test.y = predict.lm(fit, newdata = validation_, interval = "prediction")

    validation.mse[i,j] = mean(test.y[,1] - validation_$y)^2

  }

  if(i == 1)

    plot(1:deg, validation.mse[i,], col = i, type = "l", lwd=2, ylim = c(0,.06),

         xlab="Degree of Polynomial", ylab = "Mean Squared Error")

  else

    lines(1:deg, validation.mse[i,], col = i, lwd=2)

}

plot(1:deg, colMeans(validation.mse), type = "b", col = "red", lwd=2, 

     xlab="Degree of Polynomial", ylab = "Mean Squared Error") 

The validation set approach was used on the simulated data set in order to estimate the test error that results from predicting Y using a polynomial function of X. (a) The validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach. (b) Validation error estimates for a single split into training and validation data sets.  We have generated 100 random numbers uniformly distributed between (0,1) and considered them as the observations on the input variable X. Then the observations for the response variable Y are generated using the following equation Y = 2 + 3X^2 + e, which is also called the population regression line. e was simulated from a normal distribution with mean zero and variance 0.5. The data is randomly split into two equal parts each containing 50 observations. The validation set error rates that result from fitting various regression models on the training sample and evaluating their performance on the validation sample, using MSE as a measure of validation set error. The following program in R runs polynomials of different degrees and select the best-fitted polynomial based on the average validation set MSE. The output is also given at the end of the program. 

Leave One Out Cross Validation

Suppose that we have the dataset (x_i, y_i), i = 1, · · · , n. By this method, the set of observations are divided into two parts; the test set contains only one observation (say (x_1, y_1) ) and the training set contains the data (x_i, y_i) i = 2, · · · , n. The model is fitted on the training set and test MSE is computed by predicting the response y1 in the test set, call it MSE_1. The process is continued for (x_2, y_2),...,(x_n, y_n) and MSE_2,...,MSE_n are computed respectively. The average test MSE is computed as 

MSE =(MSE_1+MSE_2+ · · · +MSE_n)/n. 

In the following R codes we demonstrate the method on the simulated data. The technique can be implemented for various statistical learning methods and the method with the smallest LOOCV test error is selected as the best choice.

par(mfrow=c(1,1), mar= c(5, 4, 4, 2) + 0.1)

deg = 10

run = 10

tmp.train.mse = rep(NA, deg)

validation.mse = rep(NA, deg)

tmp.mse = rep(NA, nrow(data))

for (j in 1:deg) {

  for (k in 1:nrow(data)) {

    train_ = data[-k,]

    validation_ = data[k,]

    fit = lm(y ~ poly(x, j), data = train_)  

    test.y = predict.lm(fit, newdata = validation_, interval = "prediction")

    tmp.mse = (test.y[,1] - validation_$y)^2

  }  

  validation.mse[j] = mean(tmp.mse)

}

plot(1:deg, validation.mse, type = "b", col = "red", lwd=2, 

     xlab="Degree of Polynomial", ylab = "Mean Squared Error")

K-fold cross validation

The sample is divided into k non-overlapping equal size sets. For each k sets we use k-1 fold for training and the remaining for testing. The mean squared error, MSE_1, is then computed on the observations in the held-out fold. This procedure is repeated k times, each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE_1, MSE_2, . . . , MSE_k. The k fold CV estimate for the test MSE is computed by averaging these values

MSE =(MSE_1+MSE_2+ · · · +MSE_k)/k

set.seed(1)

x = rnorm(100)

y = x -2 *x^2+ rnorm(100)

data = data.frame(x,y)

deg = 10; run = 10

mse=rep(0,deg)

for(i in 1:run){

  fold = sample(rep(1:5,20))

  for(j in 1:deg){

    yhat = rep(0,100)

    for (k in 1:5){

      train = (1:100)[fold!=k]

      lm.fit = lm(y~poly(x,j),data = data, subset=train)

      yhat[-train]= predict(lm.fit,mydata)[-train]

    }

    mse[j] = mean((yhat-y)^2)

  }

  if(i == 1)

    plot(1:deg, mse, type = "l", col = i, lwd=2, 

         xlab="Degree of Polynomial", ylab = "Mean Squared Error")

  else

    lines(1:deg, mse, col=i, lwd=2)

}

Exercise - 8. Model flexibility and predictive ability. You are asked to understand the following code and interpret the output generated by it.

library(ISLR)

set.seed(runif(1, 1, 100))

x = runif(n=30, min = 0, max = 1)

y = 1 + 2 * x^3 + rnorm(n=30, mean = 0, sd = 0.5)

data = data.frame(x,y)

deg = 9

run = 100

train.mse = rep(NA, deg)

test.mse = rep(NA, deg)

tmp.train.mse = rep(NA, run)

tmp.test.mse = rep(NA, run)

par(mfrow=c(3,3), mai=c(0.4,0.4,0.4,0.4))

for(i in 1:deg){

  plot(x, y, col="red", main="", lwd=2)

  for(j in 1:run){

    train = sample(1:nrow(data), size = floor(0.7*nrow(data)))

    train_ = data[train,]

    test_ = data[-train,]

    fit = lm(y ~ poly(x, i), data = train_)  

    tmp.train.mse[j] = sum(fit$residuals^2)/nrow(train_)

    test.y = predict.lm(fit, newdata = test_, interval = "prediction")

    tmp.test.mse[j] = mean(test.y[,1] - test_$y)^2

  }

  train.mse[i] = mean(tmp.train.mse)

  test.mse[i] = mean(tmp.test.mse)

  train.mse[i]

  test.mse[i]

  fit.all = lm(y ~ poly(x,i), data = data)

  tmp = seq(from = 0, to = 1, by = .01)

  tmp.fit = predict(fit.all, newdata = data.frame(x=tmp), interval = "prediction")

  lines(tmp, tmp.fit[,1], lwd= 2)

  text(.2,max(y)-.5,  paste("n =", i), bty="n", lwd=2, cex=1.4)

}

par(mfrow=c(1,1), mar= c(5, 4, 4, 2) + 0.1)

ylim = max(train.mse,test.mse)

plot(1:deg, train.mse, type="o", col="red", cex.lab=1.5,

     xlab= "Degree of polynomial", ylim=c(0,ylim), lwd=2, ylab="mse")

lines(1:deg, test.mse, type = "o", col = "green", lwd=2)

legend("topleft", col=c("red","green"), cex=1.3,legend=c("train mse", "test mse"), bty="n", lwd=2)

Exercise-16.  Read about the packages alr3 and alr4. This will be very helpful while discussing nonlinear regression. 

Exercise-17. Solve the exponential, logistic, theta logistic, gompertz and Allee model analytically and sketch the solution curve in R without using the function ode from deSolve package.

Nonlinear Regression

We have already discussed some aspects of linear regression using R particularly the use of the function lm in R. This function is particularly useful for the linear models or nonlinear models which can be converted to a linear model. However, there are many models that can not be written as a linear function of the parameters. In such a case, nonlinear regression modeling is required. In R the nls function takes care of the estimation of parameters for nonlinear function. Since we have already learned about the different growth curve models, I shall use these models as nonlinear function whose parameters are to be estimated from given data. To do this, we shall write functions for all the models. I am listing the codes which are mainly written by the M.Sc students. The idea is that, for each of the growth model X_t = f(t), we shall simulate the data at different time points. Then we fit the model using the simulated data and obtain the parameter estimates using nls function. We shall compare the true curve and the estimated curve by means of graphs. Note that, we have already solved the growth equations in Exercise-16 to obtain X_t as a function of t. Run the following codes, the idea will be clear.

Exponential Model

x0 = 5; r=0.6; 

t = seq(0, 5, by = 0.1)

x = x0*exp(r*t) + rnorm(length(t), 0, 9)

data = data.frame(t, x)

plot(t, x, type = "p", xlab = "time", ylab = "Size")

legend('topleft', legend = "Exponential Model", bty="n")

curve(x0*exp(r*x), col = 4, lwd= 2, add = TRUE)

ExpModel = function(t, x0, r){

  x0*exp(r*t)

}

fit = nls(x ~ ExpModel(t, x0, r), data = data, start = list(x0 = 6, r = 0.7))

summary(fit)

lines(t, fitted.values(fit), col = 2, lwd = 2)

legend("bottomright", legend = c("Estimated", "True"), col = c(2, 4), lwd = c(2,2), bty = "n")

Logistic Model

LogisticModel = function(t, x0, r, K){

  K/(1+(K/x0 - 1)*exp(-r*t))

}

x0 = 5; r = 0.8; K = 20; 

t = seq(0, 15, by = 0.3)

x = LogisticModel(t, x0 = 5, r = 0.8, K = 20) + rnorm(length(t), 0, 2)

data = data.frame(t, x)

plot(t, x, type = "p", xlab = "time", ylab = "Size")

legend('topleft', legend="Logistic Model", bty="n")

curve(K/(1+(K/x0 - 1)*exp(-r*x)), lwd = 2, col = 2, add = T)

fit = nls(x ~ LogisticModel(t, x0, r, K), data = data, start = list(x0 = 6, r = 1, K = 20) )

summary(fit)

lines(t, fitted.values(fit), col = 3, lwd = 3)

legend("bottomright", legend = c("Estimated", "True"), col = c(3, 2), lwd = c(2,2), bty = "n")

deltaMethod(fit,"1/r*log(K/x0-1)")            #Estimate and variance of function of parameters using                                                    #delta method

theta-Logistic Model

ThetaLogisticModel = function(t, r, x0, K, theta){

  K/(1+exp(-theta*r*t)*((K/x0)^(theta)-1))^(1/theta)

}

x0 = 5; r = 0.4; K = 20; theta = 0.7; 

t = seq(0, 15, by = 0.3)

x = ThetaLogisticModel(t, r = 0.4, x0 = 5, K = 20, theta = 0.7) + rnorm(length(t), 0, 1)

data = data.frame(t, x)

plot(t, x, type = "p", xlab = "time", ylab = "Size")

legend('topleft', legend="theta-Logistic Model", bty="n")

curve(ThetaLogisticModel(x, r, x0, K, theta ), lwd = 2, col = 3, add = TRUE)

fit = nls(x ~ ThetaLogisticModel(t, r, x0, K, theta), data = data, 

          start = list(x0 = 5, r = 0.4, K = 20, theta = 0.7))

lines(t, fitted.values(fit), col = 2, lwd = 2)

legend("bottomright", legend = c("Estimated", "True"), col = c(2, 3), lwd = c(2,2), bty = "n")

Gompertz Model

GompertzModel = function(t, x0, r, alpha){

  x0*exp(r*(1-exp(-alpha*t))/alpha)

}

x0 = 5; alpha = 0.4; r = 0.5

t = seq(0, 15, by = 0.3)

x = GompertzModel(t, r = 0.5, x0 = 5, alpha = 0.4) + rnorm(length(t), 0, 1)

data = data.frame(t, x)

plot(t, x, type = "p", xlab = "time", ylab = "Size")

legend('topleft', legend = "Gompertz Model", bty = "n")

curve((x0*exp(r*(1-exp(-alpha*x))/alpha)), lwd = 2, col = 3, add = T)

fit = nls(x ~ GompertzModel(t, x0, r, alpha), data = data, start = list(x0 = 5, r = 0.5, alpha = 0.4))

lines(t, fitted.values(fit), col = 2, lwd = 2)

legend("bottomright", legend = c("Estimated", "True"), col = c(2, 3), lwd = c(2,2), bty = "n")

Bootstrapping nonlinear regression model

Note that we have already computed the bootstrap estimates of the parameters of the linear regression mdoel. The same thing can be done for nonlinear regression model also. First, we will show by explicitly writing codes and then we shall see the use of special packages in R to obtain the bootstrap distribution of the parameters of nonlinear models. We take the logistic model as a case study. It can be easily modified for other models. 

LogisticModel = function(t, x0, r, K){

  K/(1+(K/x0 - 1)*exp(-r*t))

}

control = nls.control(maxiter = 500)

x0 = 5; K = 20; r = 0.8

t = seq(0, 15, by = 0.3)

x = LogisticModel(t, x0 = 5, r = 0.8, K = 20) + rnorm(length(t), 0, 2)

data = data.frame(t, x)

B = 500

boot.x0 = numeric(B); boot.r = numeric(B); boot.K = numeric(B)

for(i in 1:B){

  brow = sample(1:nrow(data), replace = T)

  newdata = data[brow,]

  fit = nls(x~LogisticModel(t, x0, r, K), control=control, 

            data = newdata, start = list(x0 = 5, K = 20, r = 0.8))

  boot.x0[i] = coef(fit)[1]

  boot.K[i] = coef(fit)[2]

  boot.r[i] = coef(fit)[3]

}

par(mfrow=c(3,3))

hist(boot.x0, probability = T, col = "grey", main = "histogram of x0")

plot(boot.x0, boot.K, type = "p", col = "red")

plot(boot.x0, boot.r, type = "p", col = "red")

plot(boot.K, boot.x0, type = "p", col = "red")

hist(boot.K, probability = T, col="grey", main = "histogram of K")

plot(boot.K, boot.r, type = "p", col="red")

plot(boot.r, boot.x0, type = "p", col="red")

plot(boot.r, boot.K, type = "p", col="red")

hist(boot.r, probability = T, col="grey", main = "histogram of r")

Note: nls.control() can be used to control the estimation algorithm used by nls function. Type nls.control() in the console and check.

Note: Available methods for extracting various quantities from and nls() model fit.

a) anova      Comparison of several nested model fits

b) coef       Parameter estimates from model fit

c) confint    Profile likelihood based confidence intervals from MASS package

d) fitted     The prediction based on the fitted model on the used data set

e) residuals  Residuals from the model fit

f) summary    Summary of the output

g) vcov                Variance-Covariance matrix of the parameter estimates.

Real Data Analysis

Data were collected on length of fish, Cirrhinus mrigala, at 12 consecutive time points for each of the four equispaced directions to be referred to as A, B, C and D, emanating at 45◦ from the center of the lake to its four corners. At each time point, 12 measurements were available. Fishes were combined in the hoop nets placed at an equal radial distance from the center in each of these directions.

setwd("Folder Path")  #Set working directory.

#LOCATION A

data.A = read.table("LocationA.txt", header = FALSE, sep = "\t", na.strings = "NA") 

time = 1:12

class(data.A)

dim(data.A)

matplot(time, t(data.A), type = "p", col = 1, ylab = "Observed length (cm)", xlab = "Time (week)", main = "Location A")

x = colMeans(data.A)

t = 1:12

data = data.frame(t, x)

fit = nls(x ~ ExpModel(t, x0, r), data = data, start = list(x0 = x[1], r = 0.7))

summary(fit)

lines(t, fitted.values(fit), col = "orange", lwd = 2)

resid(fit)

mse = mean(resid(fit)^2)

fit1 = nls(x ~ LogisticModel(t, x0, r, K), data = data, start = list(x0 = x[1], r = 0.7, K = x[12]) )

summary(fit1)

lines(t, fitted.values(fit1), col = "red", lwd = 2)

resid(fit1)

mse1 = mean(resid(fit1)^2)

mse1

fit2 = nls(x ~ ThetaLogisticModel(t, r, x0, K, theta), data = data, start = list(x0 = x[1], r = 0.4, K = x[12], theta = 1))

summary(fit2)

lines(t, fitted.values(fit2), col = "blue", lwd = 2)

resid(fit2)

mse2 = mean(resid(fit2)^2)

fit3 = nls(x~GompertzModel(t, x0, r, alpha), data = data, start = list(x0 = x[1], r = 0.5, alpha = 0.4))

lines(t, fitted.values(fit3), col = "green", lwd = 2)

resid(fit3)

mse3 = mean(resid(fit3)^2)

legend("topleft", legend = c("Exponential", "Logistic", "Theta-logistic", "Gompertz"), col = c("orange","red", "blue", "green"), lwd = c(2,2,2,2), bty = "n")

mse

mse1

mse2

mse3

#If MSE is considered as the measure of goodness of fit then theta-logistic model is the best model describing the data.

AIC(fit) #Akike Information criteria

AIC(fit1)

AIC(fit2)

AIC(fit3)

#A model with lower AIC id preferred than the model with higher AIC.

#If AIC is considered as the measure of goodness of fit then theta-logistic model is the best model describing the data.

summary(fit2) #Summary of the best model.

More conceptual problems:

In the above in some simulation exercise, we have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. Explore this fact using the following simulated data. The problem is from James et al. (2013), Chapter 6.

    

More conceptual problems:

You are expected to run the following codes. Some nice graphics are also there.

Auto = read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.data", head=T)

library(ISLR)

data(Auto)

names(Auto)

head(Auto)[,1:4]

dim(Auto)

tail(Auto)[,1:4]

class(Auto)

Auto = na.omit(Auto)        # function simply removes the row which has missing values

dim(Auto)

summary(Auto)               # to get the descriptive summary for each of the variables in the Auto dataset

plot(Auto$cylinders , Auto$mpg )

attach(Auto)                # See the help file for attach() function

plot(cylinders, mpg)

plot(cylinders, mpg, main="cylinders and mpg", xlab="number of cylinders", ylab="miles per gallon", col="red", lwd=4, cex.lab=2, cex.main = 3 )

plot(cylinders, mpg, main="cylinders and mpg", xlab="number of cylinders", ylab="miles per gallon", col="red", lwd=4, cex.lab=2, cex.main = 3 )

The cylinders variable is stored as a numeric vector. R has treated it as quantitative. Only a *small* number of possible values for cylinders. One may prefer to treat it as a qualitative variable

cylinders = as.factor(cylinders)

class(cylinders)

In real data many times, categorical variables are stored with 0 and 1. (Married = 0, Unmarried=1; High = 2, Medium = 1, Low=0). as.factor( ) function can be useful in these cases. Let us plot again. - The plot( ) produces boxplot. The variable plotted on the x-axis is categorial. Then boxplots will automatically be produced by the plot() function. As usual, a number of options can be specified in order to customize the plots.

plot(cylinders, mpg)

plot(cylinders , mpg , col ="red ")

plot(cylinders, mpg ,col ="grey" , varwidth =FALSE, horizontal =T)

plot(cylinders , mpg , col ="green", varwidth =T)

plot(cylinders, mpg,  col =c("red", "green","blue","grey","yellow"), varwidth =F, xlab=" cylinders ", ylab ="MPG ")

boxplot(mpg~cylinders, data=Auto, col =2:6, xlab="cylinders", ylab ="MPG")

boxplot(mpg~cylinders, data=Auto, col =2:6, notch=TRUE, xlab="cylinders", ylab ="MPG ")

boxplot(mpg, data=Auto, horizontal=TRUE, lwd=3, cex.axis=2, main="box plot of mpg", cex.main=2); box(lwd=3)

text(x=rep(c(13,20,26,37),2), y=c(rep(1.4,4),rep(.7,4)), labels=c(rep("25%",4),1:4), cex=2)

abline(v=quantile(mpg), lty=2, col=2:6, lwd=2.5)

Recall that, cylinders is now a categorical variable. We want to draw a barplot of the variable cylinders table(cylinders) compute the frequencies of each class

barplot(table(cylinders) , cex.axis = 2, cex.label=2, beside=T)

barplot(freq, col=2:6, main="barplot of cylinders")

hist(mpg, col = 2)

hist(mpg,col=3,prob=T)

pairs(Auto) function creates a scatterplot matrix. A scatterplot for every pair of variables would be helpful. A subset of the variables also can be plotted.

pairs(Auto)

pairs(~mpg + displacement + weight, Auto, col=2)

library(car)

scatterplotMatrix(~mpg+displacement + weight, data=Auto, diagonal = "histogram")

scatterplotMatrix(~mpg+displacement + weight, data=Auto, diagonal = "density")

scatterplotMatrix(~mpg+displacement + weight, data=Auto, diagonal = "boxplot")

help(scatterplotMatrix)

library(UsingR)

norm = rnorm(200)

simple.hist.and.boxplot(norm, main="symmetric")

expo = rexp(200, rate=5)

simple.hist.and.boxplot(expo, main="long tailed")

unif = runif(200, min=-5,max=5)

simple.hist.and.boxplot(unif, main="short tailed")

logis = rlogis(200, 0, scale = 1)

simple.hist.and.boxplot(logis, main="both tail long")

par(mfrow=c(1,2))

simple.violinplot(mpg~ cylinders, data = Auto, col="lightgrey")

box(lwd=3)

simple.densityplot(mpg~ cylinders, data=Auto, col=2, lwd=2)

box(lwd=3)

plot(horsepower, mpg)

identify(horsepower, mpg, name)                    # Interactive plot in R. Chosing points by name

cyl.freq = table(cylinders)

pie(cyl.freq)

names = c("cylinder = 8 \n", "cylinder = 4 \n", "cylinder = 6 \n", "cylinder = 3 \n", "cylinder = 5 \n")

colors = c("red","green","blue", "yellow", "grey")

pie(cyl.freq, col=colors, labels =names)

pct = round(cyl.freq/sum(cyl.freq)*100)

pct = paste("(", pct, "%",")", sep=" ")

pctnames = paste(names, pct )

pie(cyl.freq, col=rainbow(length(cyl.freq)), labels =pctnames)

cyl.freq = table(cylinders)

names = c("   cylinder = 8 \n", "cylinder = 4 \n", "cylinder = 6 \n    ", "cylinder = 3 \n   ", "    cylinder = 5 \n")

colors = c("red","green","blue", "yellow", "grey")

pie(cyl.freq, col=colors, labels =names)

pct = round(cyl.freq/sum(cyl.freq)*100)

pct = paste("(", pct, "%",")", sep=" ")

pctnames = paste(names, pct )

pie(cyl.freq, col=colors, labels =pctnames)

library(plotrix)

pie3D(cyl.freq,labels=pctnames,explode=0.1, main="Pie Chart of cylinders ", labelrad = 1.6)

library(lattice)

p1 = histogram(~mpg | factor(cylinders), data=Auto)

print(p1)

# Surface plots

x = seq(from = -2, to= 2, by = 0.1)

y = seq(from = -1, to = 1, by = 0.1)

model = function(x,y){

  3+x+y

}

z=outer(x,y,model)

persp(x,y,z, theta=50, col="red")

model = function(x, y){

  3+3*x^2+.1*y^3

}

z = outer(x, y, model)

persp(x,y, z, theta=50, ticktype="detailed", col="red")

persp(volcano,col="grey")

contour(volcano)

library(vioplot)

vioplot(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width, col="grey")

library(lattice)

wireframe(volcano, shade = TRUE, aspect = c(61/87, 0.4), light.source =c(10,0,10))

library(lattice)

cloud(Sepal.Length ~ Petal.Length * Petal.Width | Species, data = iris, screen =list(x = -90, y = 70), distance = .4, zoom = .6)

cloud(Sepal.Length ~ Petal.Length * Petal.Width, data = iris, screen = list(x =-90, y = 70), distance = .4, zoom = .6)

library(playwith)

library(zoo)

playwith(xyplot(sunspots ~ yearmon(time(sunspots)), xlim = c(1900, 1930), type ="l"), time.mode = TRUE)

Three sessions (4 hours each) on R Programming during Software Lab-II (Semester-III, 2017):

setwd("C:/Users/admin/Desktop/RLecture")

getwd()

#Simulating random numbers following some distribution

#Generating uniform random numbers

set.seed(1) #Generate same set of random numbers

x = runif(n = 100, min=0, max = 1)

print(x)

x = runif(100); print(x)

x = runif(100, 1, 10); print(x)

#Some important functions

range(x)

max(x)

min(x)

mean(x)

sd(x)

is.na(x)

sum(is.na(x))

summary(x)

#Histogram

set.seed(1)

x = runif(10000, min = 1, max = 5)

hist(x, probability = T)

abline(h = 1/(5-1), lwd =2, col = "red")

#lwd is used to fix the width of the line

#Multiple plots in one plot are

par(mfrow = c(2,2))

n = c(10, 100, 1000, 10000)

for(i in 1:length(n)){

  set.seed(1)

  x = runif(n[i], min = 1, max = 5)

  print(summary(x))

  hist(x, probability = T)

  abline(h = 1/(5-1), lwd =2, col = "red")

}

#Generating random numbers from exponential distribution

par(mfrow = c(1,1))

lambda = 2

curve(lambda*exp(-lambda * x), from = 0, to=5 )

lambda = 1

curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)), 

      lwd =2, col = "red", ylim=c(0,2), main = "Exponential density function")

lambda = 2

curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)), 

      lwd =2, col = "blue", add = T)

lambda = 3

curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)), 

      lwd =2, col = "green", add = T)

legend = c(expression(paste(lambda," = 1")), expression(paste(lambda," = 2")), expression(paste(lambda," = 3")))

legend("topright", legend= legend, bty="n", lwd = c(2,2,2), col = c("red","blue", "green"))

# Checking for density function

lambda = 2

f = function(x){

  lambda*exp(-lambda*x)

}

I = integrate(f, lowe = 0, upper = Inf)

print(I)

str(I)  # What are information available in I

print(I$value)

# Expected value

fx = function(x){

  x*f(x)

}

I1 = integrate(fx, lower = 0, upper = Inf)

print(I1)

fx2 = function(x){

  x^2*f(x)

I2 = integrate(fx2, lower = 0, upper = Inf)

print(I2)

fvar = I2$value - I1$value^2

print(fvar)

set.seed(1)

x = rexp(100)

print(x)

hist(x, probability = T)

lambda = 1

curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)), lwd =2, col = "red", add =T)

#Generating random numbers from normal distribution

mu = 0

sigma = 1

f = function(x){

  (1/sqrt(2*pi*sigma^2))*exp(-(x-mu)^2/(2*sigma^2))

}

I = integrate(f, -Inf, Inf)

print(I)

curve(f(x), -4, 4, lwd = 2, col = 2, main = expression(frac(1,sqrt(2*pi*sigma^2))*e^(-frac((x-mu)^2,2*sigma^2))))

legend(1.5, 0.4,legend=expression(paste(mu,"=0, ", sigma, "=1")), bty = "n", col = 2, lwd = 2)

set.seed(1)

x = rnorm(10000)

hist(x, probability = T, xlim = c(-3,3), ylim = c(0, 0.5))

curve(f(x), -3, 3, add = T, col = "blue")

mean(x)

sd(x)

#--------------------------------------------------------------------------------------

Central limit theorem.

#--------------------------------------------------------------------------------------

x1 = rexp(1000)            # Generate 1000 random numbers from exp(1) distribution

x2 = rexp(1000)

x3 = rexp(1000)

x4 = rexp(1000)

hist(x1, prob = T)         # Histogram of x2, x3,... will be similar, hence ommitted

par(mfrow = c(2,2))        # Multiple plots in a single plot window

y1 = x1

hist(y1, prob = T)         # Histogram of Y_1

y2 = (x1 + x2)/2

hist(y2, prob = T)         # Histogram of Y_2

y3 = (x1+x2+x3)/3

hist(y3, prob = T)         # Histogram of Y_3

y4 = (x1+x2+x3+x4)/4

hist(y4, prob = T)         # Histogram of Y_4

Function_CLT = function(n){

  set.seed(1)

  x = matrix(NA, nrow = n, ncol = 100)

  for(i in 1:n){

    x[i,] = rexp(100)

  }

  y = matrix(NA, nrow = n, ncol = 100)

  for(i in 1:n){

    for(j in 1:100){

      y[i,j] = sum(x[1:i,j])/i

    }

  }

  par(mfrow = c(3,3))

  for(i in 1:n)

    hist(y[i,], prob = T, col= "lightgrey", main = paste("n=",i), xlab = expression(paste(Y)))

}

Function_CLT(9)

# Extend the above function to include the user input for the choice of distribution of X1, dist =

# "uniform", "binomial", "poisson", "exponential".

# Simulating the data from the normal distribution. You must play with the functions rnorm(), runif(), rexp() etc.

#Normal distribution

x = rnorm(100)

hist(x, prob = T)

# Distribution function of normal random variable

y = seq(-4, 4, length.out = 1000)

pnorm(0)

z = pnorm(y)

z

plot(y, z, ylab = expression(P(X<=y)))

# Density function of normal random variable

y = seq(-4, 4, length.out = 1000)

dnorm(0)

z = dnorm(y)

z

plot(y, z, ylab = expression(f[X](x)), xlab = expression(x))

# Overlay density function and histogram

x = rnorm(100)

hist(x, prob = T)

lines(sort(x), dnorm(sort(x)), col = "red", lwd = 2)

# Quantile function

qnorm(0.5)

# Demonstration for exponential distribution

x = rexp(100)

hist(x, prob = T)

y = seq(0,5, length.out = 100)

lines(y, dexp(y), col = "red", lwd = 2)

plot(y, pexp(y), main = "Distribution function of X~exp(1) ")

# Simple linear Regression

b0 = 1; b1 = 2                    # Fix the population regression coefficients

x = seq(1, 10, length.out = 30)   # Input variables or the explanatory variable

y = b0 + b1* x + rnorm(length(x)) # Generate data values for the response variable

plot(x, y, col = "red")           # The scatter plot

data = data.frame(x = x, y = y)

summary(data)

fit = lm(y ~ x, data = data)

summary(fit)

abline(fit, lwd =2, col = "blue")

legend("topleft",legend=expression(paste("y=", 1.19 + 1.95*x)), bty = "n")

par(mfrow = c(2,2))

plot(fit)

residuals = y - fitted.values(fit)

plot(fitted.values(fit), residuals, xlab = "fitted values")

abline(h =0, col = "red", lwd = 2)

hist(residuals, prob = T)

std_e = as.numeric(scale(residuals))

hist(std_e,prob = T)

str(fit)

fit$coefficients

fit$residuals

fit$fitted.values

#-----------------------------------------------------------------------------------------

Regression with more than one explanatory variables. The function lm() allows to use multiple input/ explanatory variables also.

#-----------------------------------------------------------------------------------------

library(MASS)

data("Boston")

View(Boston)

fix(Boston)

edit(Boston)

help(Boston)

class(Boston)

names(Boston)

head(Boston, n =10)

tail(Boston)

summary(Boston)

dim(Boston)

sum(is.na(Boston))

pairs(Boston[,1:6])

fit = lm(medv ~ dis + nox + crim, data = Boston)

summary(fit)

#----------------------------------------------------------------------------------------------

In the following section, we shall discuss the fitting of nonlinear mathematical function to the data. The equations may be converted to a linear equation and then apply the linear regression model

#------------------------------------------------------------------------------------------------

set.seed(12345)

x = seq(0, 5, length.out = 50)

b0 = 0.8; b1 = 0.1            # Fix the value of the population parameters

y = b0*exp(b1*sqrt(x) + rnorm(length(x), mean = 0, sd = 0.01)) # Generate the response variable

plot(x, y, lwd = 2, col = "red")

d = data.frame(x=x, y= y )

#--------------------------------------------------------------------------------

Many nonlinear functions can be converted to a linear function by taking some suitable transformation of the regressor and the response variables. Let us take the following transformation.  

#-------------------------------------------------------------------------------

Y = log(y)                  # Transformed response variable 

X = sqrt(x)                 # Transformed regressor variable

data = data.frame(X, Y)

fit = lm(Y~X, data = data)  # Fit simple linear regression to the transformed data

summary(fit)                # Look at the summary of the fit

B0 = coef(fit)[1]           # Estimate of the parameters for the transformed model (linear)

B1 = coef(fit)[2]

plot(X, Y, col = "red", lwd = 2)

abline(fit, col = "blue", lwd =2)

                            # Let us now get back to the nonlinear equation and see the fitting

b0_hat = exp(B0);           #Retrieve the estimates of the parameters of the nonlinear model

print(b0_hat)

b1_hat = B1; 

prin(b1_hat)

# Let us now plot the nonlinear equation to the scatter plot of the original variables

main = expression(paste(y, " = ",b[0]*e^(b[1]*sqrt(x)) ))

plot(x, y, lwd = 2, col = "red", main = main)

curve(b0_hat*exp(b1_hat*sqrt(x)), 0, 5, lwd = 2, col = "blue", add = TRUE)

#-------------------------------------------------------------------------------------

The nonlinear regression function may not be convertible to a linear regression problem by using some suitable mathematical transformation. In that case we have to use some numerical routine to estimate the parameters for the nonlinear model. This can be done in R by using the nls() fuction. The estimates are obtained by using multivariate Newton Raphson method. Refer the discussion in class for score equation and the Hessian matrix

#-----------------------------------------------------------------------------------

fit = nls(formula = y~ b0*exp(b1*sqrt(x)), data = d, start = list(b0 = 0.7, b1 = 0.13))

summary(fit)

yhat = fitted.values(fit)

plot(x,y, lwd = 2, col = "red")

lines(x, yhat, col = "blue", lwd = 2)

# Investigate the residuals

plot(yhat, residuals(fit), lwd = 2)

e = residuals(fit)

e_std = (e-mean(e))/sd(e)

qqnorm(e_std)

qqline(e_std, col ="red", lwd = 2)

#-------------------------------------------------------------------------------------

# nls() allows the user to deal with the situation of heteroschedasticity, which means that the 

# variance of the errors are not same at all input levels In nls() try changing the option weights =

# Refer to the discussion in the class for choosing the weight appropriately

#--------------------------------------------------------------------------------------

fit = nls(formula = y~ b0*exp(b1*sqrt(x)), data = d, start = list(b0 = 0.7, b1 = 0.13), weights = x^(1/5))

plot(fitted.values(fit), residuals(fit))

#----------------------------------------------------------

# Exercise 8, ISLR, CHAPTER-II

#----------------------------------------------------------

setwd("C:/Users/admin/Desktop/RLecture07102017")

college = read.csv("College.csv")

names(college)

univ.names = college$X

print(univ.names)

college = college[,-1]

row.names(college) = univ.names

View(college)

summary(college)

pairs(college[,1:5])

# boxplot

head(college[,c("Private", "Outstate")])

attach(college)

plot(Private, Outstate, lwd = 2, col = "lightgrey", 

     main = "Boxplot of Outstate versus Private")

Elite = rep("No", nrow(college))

Elite[college$Top10perc > 50] = "Yes"

college = data.frame(college, Elite)

names(college)

table(Elite)

plot(college$Elite, Outstate, col = "red", lwd =2, main = "Boxplot of Outstate versus Elite")

par(mfrow= c(2,2))

hist(Outstate, prob = T, col = "lightgrey")

hist(Grad.Rate, prob = T, col = "lightyellow")

hist(S.F.Ratio, prob = T, col =  "green")

hist(Books, prob = T, col = "blue")

library(ISLR)

data("Auto")

dim(Auto)

names(Auto)

complete.cases(Auto)

sum(complete.cases(Auto))

x = numeric(length = length(names(Auto))); length(x)

x = numeric(dim(Auto)[2]); length(x)

for(i in 1:length(x)){

  x[i] = ifelse(is.numeric(Auto[,i]), 1, 0)

}

sum(x)

sapply(Auto, FUN = is.numeric)

# Range of the quantitative variables

which(x==0)

sapply(Auto[,-c(which(x==0))], FUN = range)

sapply(Auto[,-c(which(x==0))], FUN = mean)

sapply(Auto[,-c(which(x==0))], FUN = sd)

newAuto = Auto[-c(10:85),]

sapply(newAuto[,-c(which(x==0))], FUN = range)

sapply(newAuto[,-c(which(x==0))], FUN = mean)

sapply(newAuto[,-c(which(x==0))], FUN = sd)

table(Auto$cylinders)

table(Auto$origin)

# We treat them as factor variables because number of unique values for origin is 3 and for cylinders is 5.

Auto$cylinders = as.factor(Auto$cylinders)

Auto$origin = as.factor(Auto$origin)

# Following boxplots may be informative

plot(Auto$cylinders, Auto$mpg, col = "lightgrey", pch =20, cex=1, ylab = "mpg", xlab = "Cylinders")

plot(Auto$cylinders, Auto$displacement, col = "lightgrey", pch =20, ylab = "displacement", xlab = "Cylinders")

plot(Auto$cylinders, Auto$horsepower, col = "lightgrey", pch =20, ylab = "horsepower", xlab = "Cylinders")

plot(Auto$cylinders, Auto$weight, col = "lightgrey", pch =20, ylab = "weight", xlab = "Cylinders")

plot(Auto$origin, Auto$mpg, col = "lightgrey", pch =20, cex=1, ylab = "mpg", xlab = "origin")

plot(Auto$origin, Auto$displacement, col = "lightgrey", pch =20, ylab = "displacement", xlab = "origin")

plot(Auto$origin, Auto$horsepower, col = "lightgrey", pch =20, ylab = "horsepower", xlab = "origin")

plot(Auto$origin, Auto$weight, col = "lightgrey", pch =20, ylab = "weight", xlab = "origin")

pairs(Auto)

#Scatterplot suggests that the quantitative variable year does not seem important

pairs(Auto[,c("mpg", "displacement","horsepower","weight", "acceleration")])

# The predictors weight and horsepower seem to be linearly related

# horsepower and displacement seem to be linearly related

# mpg seem to be nonlinearly related to displacement, horsepower and weight

# acceleration does not seem to impact mpg

# There seems to be relationship between horsepower and accelaration

# There seems to be relationship between displacement and accelaration

# Weight and accelration does not seem to impact each other

# Analyzing the correlation matrix visually

cor(Auto[,c("mpg", "displacement","horsepower","weight", "acceleration")])

library(corrplot)

d = Auto[,c("mpg", "displacement","horsepower","weight", "acceleration")]

class(d)

M = cor(as.matrix(d))

print(M)

par(mfrow= c(1,1))

corrplot(M)

corrplot(M, method = c("number"))

corrplot(M, method = "ellipse", type="upper")

help("corrplot")

# We comment that the horsepower and displacement are highly correlated (cor = 0.9)

# We comment that the weight and displacement are highly correlated (cor = 0.93)

# mpg is negatively correlated with displacement (r=-0.81), horsepower(-0.78), weight(-0.83) 

# The plots suggest that displacement, horsepower and weight may be used to predict mpg.

fit = lm(mpg~ displacement + horsepower + weight, data = Auto)

summary(fit)

# Let us directly apply the formula from multiple linear regression and tally them with the output of lm() function

n = nrow(Auto)

p = 3

X = cbind(rep(1, n), Auto$displacement, Auto$horsepower, Auto$weight)

head(X)

Y = Auto$mpg

beta_hat = numeric(p+1)

beta_hat = solve(t(X)%*%X)%*%t(X)%*%Y

beta_hat = as.vector(beta_hat)

abs(beta_hat-coef(fit))<10^(-6)

error = Y - X%*%beta_hat

residuals(fit)

plot(error, residuals(fit))

var_error  = sum(error^2)/(n-p-1)

vcov_beta_hat = var_error*solve(t(X)%*%X) 

vcov_beta_hat

var_beta_hat = diag(vcov_beta_hat)

sqrt(var_beta_hat)

summary(fit)

#----------------------------------------------

###################################

# Section for best subset selection

###################################

setwd("C:/Users/admin/Desktop/RLecture07102017")

getwd()

Boston = read.csv(file = "boston.csv")

names(Boston)

names(Boston) = c("crimeRate","zone","indus","charles","nox","rooms","age",

          "distances","highways","tax","teacherRatio","color","status","cost")

length(names)

dim(Boston)

head(Boston)

summary(Boston)

fit = lm(cost~., data = Boston)

summary(fit)

library(leaps)

fit = regsubsets(cost~., data = Boston, nvmax = 13, method = "forward")

summary(fit)

bestfit = summary(fit)

plot(bestfit$rsq, xlab = "Number of predictors",  col="grey",

     ylab = "R-square",pch=20, lwd=2, type="l", main = "Best fit R-square versus Number of predictors")

ind = which.max(bestfit$rsq)

points(ind, bestfit$rsq[ind], col = "red", lwd=3, 

       pch=20, cex =2)

plot(bestfit$adjr2, xlab = "Number of predictors", col="grey",

     ylab = "Adjusted R-square",pch=20, lwd=2, type="l", main= "Best fit Adj R^2 versus Number of predictors")

ind = which.max(bestfit$adjr2)

points(ind, bestfit$adjr2[ind], col = "red", lwd=3, 

       pch=20, cex =2)

plot(bestfit$cp, xlab = "Number of predictors", type="l", 

     ylab = "Mallow's Cp",pch=20, lwd=2, col ="grey", 

     main = "Best fit Cp versus Number of predictors")

ind = which.min(bestfit$cp)

points(ind, bestfit$cp[ind], col = "red", lwd=3, pch=20, cex =2)

plot(bestfit$rss, xlab = "Numbero of predictors", ylab ="RSS",

     type="l", col ="grey", lwd=2,

     main = "Best fit RSS versus Number of predictors")

ind = which.min(bestfit$rss)

points(ind, bestfit$rss[ind], col="red", lwd=2, pch=20)

coef(fit,1)

coef(fit,2)

coef(fit,10)

coef(fit,11)

coef(fit, ind )

# Visual representation

plot(fit, scale = "r2")

plot(fit, scale = "adjr2")

# Polynomial regression and best fitting

x = seq(1, 5, length.out = 100)

y = 2 + 1.5*x^2 + rnorm(length(x), 0, 4.5)

d = data.frame(x,y)

plot(x,y)

# Fitting a polynomial regression with high noise data

fit = lm(y~ x + I(x^2), data = d)

summary(fit)

curve(coef(fit)[1] + coef(fit)[2]*x + coef(fit)[3]*x^2, 0,5, 

      add= TRUE, col = "red", lwd=2 )

# Fitting a polynomial regression with low noise data

x = seq(1, 5, length.out = 100)

y = 2 + 1.5*x^2 + rnorm(length(x), 0, 2.0)

d = data.frame(x,y)

plot(x,y)

fit = lm(y~ x + I(x^2), data = d)

summary(fit)

curve(coef(fit)[1] + coef(fit)[2]*x + coef(fit)[3]*x^2, 0,5, 

      add= TRUE, col = "red", lwd=2 )

library(ISLR)

fit = lm(y~poly(x, 2, raw = TRUE), data = d)

summary(fit)

# Train test split

x = seq(1,5, length.out = 100)

y = 2 + 1.5*x^2 + rnorm(length(x), 0, sd = 6.0)

d = data.frame(y,x)

plot(x,y)

train = sort(sample(1:nrow(d), size = ceiling(nrow(d)*0.6), replace = FALSE))

train

train_data = d[train,]

test_data = d[-train,]

deg = 1:10

train_mse = numeric(length = length(deg))

test_mse = numeric(length = length(deg))

for(i in deg){

  fit = lm(y~poly(x, i, raw = TRUE), data = train_data)

  train_mse[i] = sum((fit$residuals)^2)/length(train)

  pred_y = predict(fit, newdata = test_data)

  test_mse[i] = sum((test_data$y - pred_y)^2)/nrow(test_data)

}

plot(train_mse, col= "red", type="l", lwd=2, ylab= "MSE", 

     xlab="degree of polynomial", ylim=c(0,50))

lines(test_mse, col= "blue", lwd=2)

legend("topright", legend =c("train mse", "test mse"), 

       col=c("red","blue"), lwd=c(2,2), bty="n")

# Repeating the process multiple times

run = 40

deg = 1:10

train_mse = matrix(NA, nrow= run, ncol = length(deg)) 

test_mse = matrix(NA, nrow= run, ncol = length(deg))

for(i in 1:run){

  train = sort(sample(1:nrow(d), size = ceiling(nrow(d)*0.8), replace = FALSE))

  train

  train_data = d[train,]

  test_data = d[-train,]

  

  for(j in deg){

    fit = lm(y~poly(x, j, raw = TRUE), data = train_data)

    train_mse[i,j] = sum((fit$residuals)^2)/length(train)

    pred_y = predict(fit, newdata = test_data)

    test_mse[i,j] = sum((test_data$y - pred_y)^2)/nrow(test_data)

  }

  

}

ylim = c(min(train_mse),max(test_mse))

avg_train_mse = colMeans(train_mse)

avg_test_mse = colMeans(test_mse)

plot(avg_train_mse, col = "red", type="l", lwd=2, ylim=ylim)

lines(avg_test_mse, col= "blue", type="l", lwd=2)

legend("topright", legend=c("train mse", "test mse"), 

       lwd=c(2,2), col = c("red","blue"), bty="n")

ind= which.min(avg_test_mse)

points(ind, avg_test_mse[ind], pch=4, lwd=2, col= "green")

This is an M.Sc Lab course entitled "Bioinformatics and Statistics Lab - I". This lab is mainly designed to implement different techniques of Statistics in solving problems related to Biological Sciences. This course is also open for the Ph.D. students of this Institute. I have designed the course to cover the following topics. The software "R" will be extensively used throughout the lab sessions. Particular emphasis will be given to writing functions in R rather than theory. The platform RStudio will be extensively deployed to find its extended capabilities. R is really advanced in graphics, however, in this lab, you are encouraged to make use of basic graphics functions and use them to prepare plots. This will increase your expertise in R. 

References: 

Ackwledgement: My project student Steffi D'Souza (M.Sc. 2014-16) has contributed significantly in the resampling methods section. I am really thankful for her involvement.

Introduction to R Programming. 

R was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland - Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original S language, John Chambers of Bell Labs. It is an effective data handling and storage facility.

It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from http://cran.r-roject.org/mirrors.html Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.

You are requested to copy the following codes in a new R script and the outputs are self-explainable. I will write the comments wherever necessary

x=1:10; print(x)

x=numeric(10); x

x=seq(from=1, to=10, by=2); x; help(seq)

x = seq(from = 1, to = 10, length.out = 100)

x=rep(0,10); x

x=rep(c(1,2),10); x

x=c(3,5,7,3,5,7.5,8,2)               #c stands for concatenation

sum(x)

min(x)

max(x)

cumsum(x)

cumprod(x)

which.max(x)

which.min(x)

mean(x)

sd(x)

x>5

sum(x>5)                             #number of numbers which are greater then 5

rev(x)

length(x)

sort(x, decreasing = TRUE ); help(sort)

x[x>5]                               #gives the numbers which are greater then 5

sum(x[x>5]) 

all(x>0)

any(x>0)

any(x<0)

y = c(1.3,2.5, 4/7,9, 11, 22/7)

x[3]

y[3:6]

x+y

x-y

x*y

x/y

x=c(3,5,7,3,5,7.5,8,2,NA)

sum(x)

is.na(x)                         #gives missing value is there or not

sum(is.na(x))                    #number of missing values

sum(!is.na(x))

x=x[!is.na(x)]                   #only non missing values are stored into x

square.x=x^2

square.x

sum(x^2)

class(x)

class(A)

class(C)

range(x)

summary(x)

#Matrices in R

A=matrix(c(1,2,3,4), nrow=2, ncol=2);A

B=matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE);B  #Note the difference by using byRow = TRUE

C=matrix(c("a","b","c","d"),nrow=2, ncol=2);C

D=matrix(c(1,"a",2,"b"), nrow=2, ncol=2);D          

#Note that the matrix D is generated with both numbers and characters. But if you print D you will find that all the entries are displayed as characters. Matrix can hold only same types of Data. 

A=matrix(c(rep(1,4), rep(2,4), rep(3,4)), nrow=3, ncol=4, byrow=TRUE);A

rownames(A)=c("r1","r2","r3")

colnames(A)=c("c1","c2","c3","c4"); A

rowMeans(A); colMeans(A)

new_A=cbind(A,rowMeans(A)); new_A

colnames(new_A)=c(colnames(A),"Row Means")

new_A=rbind(new_A,c(colMeans(A),sum(A)))

summary(A); help(summary)

#Operations with Matrices

A=matrix(c(1,2,3,4), nrow=2, ncol=2);A

B = matrix(c(4,5,6,8), nrow = 2, ncol = 2); B

A*B        # Elementwise multiplication

A%*%B      # Matrix multiplication

diag(A)    # diagonal entries of the matrix

x = c(1,2,3,5)

diag(x)    # square matrix with diagonal entries with x

b = c(1,10)

solve(A, b)# Solve the linear equation of the form Ax = b

solve(A)   # Inverse of the matrix A

t(A)       # Transpose of a matrix

eigen(A); help(eigen)    # Eigenvalues and eigenvectors of A

#DATA frames in R

age     = c(26, 42, 45, 31, 56)

income  = c(100, 140, 160, 92, 200)

married = c(FALSE,TRUE,FALSE,TRUE,TRUE)

name    = c("James","David", "Lisa", "Mark", "Richard")

myData  = data.frame(name, age, income, married)

class(myData); dim(myData)

names(myData); str(myData)

myData[,"name"]; myData[,1]    # access the columns by name of by index of the column

blood = c("AB+", "A", "B", "O-", "B")    # Adding new variables in your dataframe

newData = cbind(myData, blood)           # cbind stands for column bind; add new rows by rbind

newData

edit(myData)

fix(myData)

data = newData[,-c(5)]        # Deleting rows from a dataframe

data

summary(myData)

# Dataframe with missing values. The missing values are indicated by NA in R.

name=c("Bravo", "Anderson", "Gayle", "Gaurav", "Sunil")

age=c(24,33,15,20,NA)

income=c(1000,1070,1500,2500,NA)

marriage=c(TRUE, TRUE, FALSE, TRUE,NA)

mydata=data.frame(name, age, income, marriage)

mydata

summary(mydata); head(mydata,3)

mydata[,2]; mydata[,"name"]

mydata[,1]; mydata[2,]

is.na(mydata); sum(is.na(mydata))

complete.cases(mydata)                 #gives which row has missing value

mydata[complete.cases(mydata),]        #give all rows which has complete information

mydata[!complete.cases(mydata),]

newdata=na.omit(mydata)                #remove the rows which has missing values

newdata; dim(mydata); tail(mydata,2)

#Replacing the missing value

mydata[which(is.na(mydata$age)),"age"]=mean(mydata$age, na.rm=TRUE) #na.rm - removes NA value for age                                                                        #column and then find the mean

mydata[which(is.na(mydata$income)),"income"]=mean(mydata$income, na.rm=TRUE)

mydata[which(is.na(mydata$marriage)),"marriage"]=TRUE

mydata

#table function

gender = factor(c(rep("female",91),rep("male",92)))

gender

table(gender) 

gender = factor(gender, levels=c("male","female"))

table(gender)

gender = factor(gender, levels=c("Male","female"))

table(gender)

#List. Run the following code and understand the difference with matrix and dataframe

mylist = list(x, A, mydata)

mylist

mylist[[1]]

newlist=list(x, A, mydata, mylist)

newlist[[4]]

x=c(1,2,3,NA,3,5,8,NA,10,2)      #Treating Missing Values


is.na(x); !is.na(x)

sum(is.na(x))                # Number of missing values in the column

sum(!is.na(x))               # Number of non-missing values in the column

#Some topics related to graphs (Exercise 12)

#Pie Charts

states<-c("Punjab","Haryana","Tamil Nadu","Andhra Pradesh", "Uttar Pradesh","Others")

procurement<-c(5486,1248,589,3987,1296,1654)

rice.proc<-data.frame(states,procurement)

rice.proc<-transform(rice.proc,percentage=(rice.proc$procurement/14260)*100)

rice.proc; attach(rice.proc)

pie(procurement,labels=c("Punjab(38.47%)","Haryana(8.75%)","Tamil Nadu(4.13%)", "Andhra Pradesh(27.96%)", "Uttar Pradesh(9.09%)","Others(11.60%)"), main="Procurement of Rice('000 tonnes)\nduring Oct. 1993 to Sept. 1994", col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3", "purple1","violetred4"))

detach(rice.proc)

#Barplot

consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),

nrow=2,ncol=5,byrow=T)

rownames(consumption)<-c("Rural","Urban")

colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")

consumption

barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n

during 1989-'90",legend.text=rownames(consumption))

Exercise-1. Replace the missing value with the number that appears the maximum number of times.

Exercise-2. Replace the missing values with the mean 

all()         # returns TRUE if all values are TRUE

any()         # returns TRUE if any values are TRUE

args()        # information on the arguments to a function

cat()         # prints multiple objects, one after the other

cumprod()     # cumulative product

cumsum()      # cumulative sum

diff()        # form vector of first differences

              # N. B. diff(x) has one less element than x

history()     # displays previous commands used

is.factor()   # returns TRUE if the argument is a factor

is.na()       # returns TRUE if the argument is an NA

              # NB also is.logical(), is.matrix(), etc.

length()      # number of elements in a vector or of a list

ls()          # list names of objects in the workspace

mean()        # mean of the elements of a vector

median()      # median of the elements of a vector

order()       # x[order(x)] sorts x (by default, NAs are last)

print()       # prints a single R object

range()       # minimum and maximum value elements of vector

sort()        # sort elements into order, by default omitting NAs

rev()         # reverse the order of vector elements

str()         # information on an R object

unique()      # form the vector of distinct values

which()       # locates ’TRUE’ indices of logical vectors

which.max()   # locates (first) maximum of a numeric vector

which.min()   # locates (first) minimum of a numeric vector

with()        # do computation using columns of specified data frame

#User Defined Function in R

#writing first function

myfun = function(){

  return("Hellow World")

}

myfun()

myfun = function(x){

  #The function returns mean of the input

  return(mean(x))

}

myfun(1:10)

myfun = function(x){

  return(rowSums(x))

}

m = matrix(1:9, nrow = 3, ncol = 3)

myfun(m)

newfun = function(x){

  return(sum(x))

}

newfun(myfun(m))

#function to find root using bisection method

bisection.root = function(f , xmin, xmax, tol = 1e-5){

  a = xmin; b = xmax;

  if(f(a)==0)

    return(cat("The root is ", a))

  if(f(b) == 0)

    return(cat("The root is ", b ))

  

  while (b-a>tol) {

    c = (a+b)/2

    if(f(c) == 0)

      return(cat("The root is ", c ))

    if(f(a)*f(c)<0)

      b = c

    else

      a = c

  }

  return(cat("The root is", (a+b)/2))

}

f = function(x){

  x^2 - 1

}

bisection.root(f , 0.5, 1.7, tol = 1e-5)

newton.raphson = function(f, df, x0, tol = 1e-6, iteration=100){

  if(f(x0)==0)

    return(x0)

  if(df(x0)==0)

    print("Give another value for x0")

  while( i <= iteration){

      x1 = x0 - f(x0)/df(x0)               

      e = abs(x0-x1)

      x0 = x1

      if(e < tol)

        return(c(x1, i, abs(x0-x1)))

      i = i + 1

  }

  return(c(x1,iteration, abs(x0-x1)))

}

newton.raphson(x^2-2,2*x, 0.5)

Compute_mean_sd = function(x){

  print("The Mean and S.D. are:")

  return(c(mean(x),sd(x)))

}

Compute_mean_sd(1:10)

Exercise-3. Modify the function to ask for user input. Type help(scan).

x=scan()

Example-1: Write the function to check whether the roots of quadratic equation with coefficients a, b and c are real or imaginary 

root.check = function(a,b,c){

  d=b^2-4*a*c

  if(d>0)

    print("roots are real.")

  else if(d<0)

    print("roots are imaginary")

  else

    print("roots are equal")  

}

root.check(1,6,1)

#Modification of the function asking for user input

user.root.check=function(){

  coef=scan()

  if(length(coef)<3){

    print("Three numbers are required.")

    return(NULL)

  }

  else if(length(coef)>3){

    print("Only first three numbers will be used for computation")

  }

  d=coef[2]^2-4*coef[1]*coef[3]

  if(d>0){

    print("roots are real.")

    r1=(-coef[2]+sqrt(d))/(2*coef[1])

    r2=(-coef[2]-sqrt(d))/(2*coef[1])

    return(c(r1,r2))

  }

  else if(d<0){

    print("roots are imaginary.")

    r1=complex(real = -coef[2]/(2*coef[1]), imaginary = sqrt(-d)/(2*coef[1]))

    r2=complex(real = -coef[2]/(2*coef[1]), imaginary = -sqrt(-d)/(2*coef[1]))

    return(c(r1, r2))

  }

  else{

    print("roots are equal")

    r = -coef[2]/(2*coef[1])

    return(r)

  }

}

user.root.check()