Statistics Lab
The following function allows you to visualize the distribution of the sample mean from various distribution. n=number of sample you are drawing, m = number of times the samples are drawn. The central limit theorem says that (under certain conditions), when the sample size is large, the distribution of sample mean behaves like a normal distribution (whatever be the underlying population distribution from which the random samples are drawn) - Modify the following code to accept user inputs for the parameters of the distribution. Try to understand the use of sapply function. - Plot the normal density on the histogram of the sample mean.
DemoCLT = function(n=50, m=40, distribution="normal"){
if(distribution=="normal")
data = matrix(rnorm(n*m, mean=10,sd=2), nrow=n)
if(distribution == "uniform")
data = matrix(runif(n*m, min=0, max=3), nrow = n)
if(distribution == "beta")
data = matrix(rbeta(n*m, shape1 = 1, shape2=2), nrow=n)
if(distribution == "exponential")
data = matrix(rexp(n*m, rate = 1), nrow=n)
if(distribution == "binomial")
data = matrix(rbinom(n*m, size = 20, prob = 0.5), nrow=2)
colnames(data) = paste("sample", as.character(1:m))
SampleMeans = sapply(colnames(data), function(x) mean(data[,x]))
par(mfrow = c(1,2))
hist(SampleMeans, prob=T, col="lightgrey", main = "Distribution of Sample Mean")
qqnorm(SampleMeans, col="red", lwd=2)
qqline(SampleMeans, col ="blue", lwd=2)
mtext(paste("Distribution of Mean for n = ", as.character(n)), outer = TRUE, cex = 1.5)
result = shapiro.test(SampleMeans)
print(result)
}
DemoCLT(n=1, m=400, distribution = "uniform")
DemoCLT(n=5, m=400, distribution ="uniform")
#Note the use of complex class in R. Various operations on complex numbers is also available. help(complex)
Exercise-4. In a statistical application of your choice, what does a missing value mean? What are the traditional methods of imputing missing values in such an application?
Exercise-5. Create vectors of each of the different primitive types. Create matrices and arrays by attaching dim attributes to those vectors. Look up the help for dimnames and attach dimnames to a matrix with two rows and five columns.
Exercise-6. Write an R function that takes a matrix as input and standardizes each row by subtracting the median and dividing by the MAD (Mean Absolute Deviation). You might want to look at the functions mad and median. Next, generalize your function so that these are the default values, but allow the user to select other measures of the center and spread, should they want to use them.
Exercise-7. Write a function that takes $x$ and $n$ an input and returns the value of $$1+\frac{x}{1!} + \frac{x^2}{2!}+...+ \frac{x^n}{n!}.$$ Note that this gives approximation of the function $e^x$. Now Suppose, we want to get an estimate of error, while approximating exponential function with the series in some interval $[a,b]$. Write a program to take user input for $a$ and $b$, $x$ and $n$. Estimate the maximum error in the interval. Can you find a suitable $n$ for which the error is less than some tolerance limit
Some useful functions in R which are very helpful during data analysis:
is.factor(), is.logical(), as.data.frame(), as.numeric() etc.
R can deal with complex numbers also
z = complex(real = 2, imaginary = 3); z
Re(z); Im(z); Mod(z); Arg(z); Conj(z)
Resampling Techniques.
Basic concepts and terminologies
Suppose that we observe a quantitative response Y and p different predictors, X1,X2, · · · ,Xp. We assume that there is some relationship between Y and X = (X1,X2, · · · ,Xp), which can be written in the very general form
Y = f(X) + e
Here f is some fixed but unknown function of X1,X2, · · · ,Xp, and e is a random error term with mean zero, which is independent of X and has mean zero. In this formulation, f represents the systematic information that X provides about Y . The idea of statistical learning deals with several approaches to estimating the function f. The reasons behind estimating the function f are two folds, namely “prediction” and “inference”. In many real life problems the information about the input variables are readily available, however, the output Y is difficult to obtain. In such cases by estimating f we can predict Y using
Y_hat = f_hat(X)
where f_hat represents the estimate of f and Y_hat represents the resulting prediction of Y . The above equation is reasonable prediction as e averages to zero. You are requested to run the following R code and understand the output (refer to the class note). These will be used for the programs developed for different resampling techniques again and again.
set.seed(12345) # Ensures the reproducibility of the simulations
x=runif(n=100, min=0, max=1) # Generate 100 numbers from U(0,1) distribution
y=2+3*x^2+rnorm(n=100, mean=0, sd=0.5) # Simulation of response variable, see help(rnorm)
y # Print the response y
plot(x,y) # Scatterplot of x and y
data=data.frame(x,y) # Store data in a dataframe with name "data"
print(data) # Prints the contents of the data
fit=lm(y~x, data=data) # Fit a linear regression y=b_0+b_1x
summary(fit) # Get summary of the fitting exercises, MOST IMPORTANT
abline(fit, col=29, lwd=2) # Fitting the estimated regression equation to the scatterplot
fit=lm(y~x+I(x^2)+I(x^3),data=data) # Fitting cubic regression eqution, carefully check I(x^2)
str(fit) # Useful command to get structure of the class
library(ISLR) # Install the package ISLR (if not installed)
fit=lm(y~poly(x,10),data=data) # poly function takes of regression of different order # polynomials. No need to write I() everytime.
plot(x,y)
fit=lm(y~x+I(x^2),data=data)
x1=seq(from = 0, to=1, by=0.1)
y1=coef(fit)[1]+coef(fit)[2]*x1+coef(fit)[3]*x1^2
lines(x1,y1, col=3, lwd=2) # Plot the estimated polynomial regression equation
help(curve) # Explore curve function for quick plots
Validation Set Approach
Validation set approach is a simple resampling method to select a suitable statistical learning method. In this method, the available set of observations is randomly divided into two parts. One subset is called the training set and the other one is called the validation set or hold-out set. The model is fitted to the training data and estimated model is used to predict the responses in the validation set. The validation set error rate is assessed by using MSE in case of quantitative response. Validation set error rate is nothing but the estimated test error rate. In the next paragraph and we demonstrate the method by a simulation study using R.
library(ISLR)
set.seed(12345)
x = runif(n=100, min = 0, max = 1)
y = 2 + 3 * x^2 + rnorm(n=100, mean = 0, sd = 0.5)
data = data.frame(x,y)
par(mfrow = c(1,2))
deg = 10
run = 10
tmp.train.mse = matrix(NA, nrow = run, ncol = deg)
validation.mse = matrix(NA, nrow = run, ncol = deg)
for (i in 1:run) {
train = sample(1:nrow(data), size = floor(0.5*nrow(data)), replace = FALSE)
train_ = data[train,]
validation_ = data[-train,]
for (j in 1:deg) {
fit = lm(y ~ poly(x, j), data = train_)
tmp.train.mse[i,j] = sum(fit$residuals^2)/nrow(train_)
test.y = predict.lm(fit, newdata = validation_, interval = "prediction")
validation.mse[i,j] = mean(test.y[,1] - validation_$y)^2
}
if(i == 1)
plot(1:deg, validation.mse[i,], col = i, type = "l", lwd=2, ylim = c(0,.06),
xlab="Degree of Polynomial", ylab = "Mean Squared Error")
else
lines(1:deg, validation.mse[i,], col = i, lwd=2)
}
plot(1:deg, colMeans(validation.mse), type = "b", col = "red", lwd=2,
xlab="Degree of Polynomial", ylab = "Mean Squared Error")
The validation set approach was used on the simulated data set in order to estimate the test error that results from predicting Y using a polynomial function of X. (a) The validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach. (b) Validation error estimates for a single split into training and validation data sets. We have generated 100 random numbers uniformly distributed between (0,1) and considered them as the observations on the input variable X. Then the observations for the response variable Y are generated using the following equation Y = 2 + 3X^2 + e, which is also called the population regression line. e was simulated from a normal distribution with mean zero and variance 0.5. The data is randomly split into two equal parts each containing 50 observations. The validation set error rates that result from fitting various regression models on the training sample and evaluating their performance on the validation sample, using MSE as a measure of validation set error. The following program in R runs polynomials of different degrees and select the best-fitted polynomial based on the average validation set MSE. The output is also given at the end of the program.
Leave One Out Cross Validation
Suppose that we have the dataset (x_i, y_i), i = 1, · · · , n. By this method, the set of observations are divided into two parts; the test set contains only one observation (say (x_1, y_1) ) and the training set contains the data (x_i, y_i) i = 2, · · · , n. The model is fitted on the training set and test MSE is computed by predicting the response y1 in the test set, call it MSE_1. The process is continued for (x_2, y_2),...,(x_n, y_n) and MSE_2,...,MSE_n are computed respectively. The average test MSE is computed as
MSE =(MSE_1+MSE_2+ · · · +MSE_n)/n.
In the following R codes we demonstrate the method on the simulated data. The technique can be implemented for various statistical learning methods and the method with the smallest LOOCV test error is selected as the best choice.
par(mfrow=c(1,1), mar= c(5, 4, 4, 2) + 0.1)
deg = 10
run = 10
tmp.train.mse = rep(NA, deg)
validation.mse = rep(NA, deg)
tmp.mse = rep(NA, nrow(data))
for (j in 1:deg) {
for (k in 1:nrow(data)) {
train_ = data[-k,]
validation_ = data[k,]
fit = lm(y ~ poly(x, j), data = train_)
test.y = predict.lm(fit, newdata = validation_, interval = "prediction")
tmp.mse = (test.y[,1] - validation_$y)^2
}
validation.mse[j] = mean(tmp.mse)
}
plot(1:deg, validation.mse, type = "b", col = "red", lwd=2,
xlab="Degree of Polynomial", ylab = "Mean Squared Error")
K-fold cross validation
The sample is divided into k non-overlapping equal size sets. For each k sets we use k-1 fold for training and the remaining for testing. The mean squared error, MSE_1, is then computed on the observations in the held-out fold. This procedure is repeated k times, each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE_1, MSE_2, . . . , MSE_k. The k fold CV estimate for the test MSE is computed by averaging these values
MSE =(MSE_1+MSE_2+ · · · +MSE_k)/k
LOOCV is a special case of the k-fold cross validation, where k is chosen to be equal to the number of observations.
LOOCV is time consuming due the number of experiments performed, if the dataset is large enough.
With k-fold cross-validation, the computation time is reduced due to the number of experiments performed is less as well as the all the observations are eventually used to train and test the model.
set.seed(1)
x = rnorm(100)
y = x -2 *x^2+ rnorm(100)
data = data.frame(x,y)
deg = 10; run = 10
mse=rep(0,deg)
for(i in 1:run){
fold = sample(rep(1:5,20))
for(j in 1:deg){
yhat = rep(0,100)
for (k in 1:5){
train = (1:100)[fold!=k]
lm.fit = lm(y~poly(x,j),data = data, subset=train)
yhat[-train]= predict(lm.fit,mydata)[-train]
}
mse[j] = mean((yhat-y)^2)
}
if(i == 1)
plot(1:deg, mse, type = "l", col = i, lwd=2,
xlab="Degree of Polynomial", ylab = "Mean Squared Error")
else
lines(1:deg, mse, col=i, lwd=2)
}
Exercise - 8. Model flexibility and predictive ability. You are asked to understand the following code and interpret the output generated by it.
library(ISLR)
set.seed(runif(1, 1, 100))
x = runif(n=30, min = 0, max = 1)
y = 1 + 2 * x^3 + rnorm(n=30, mean = 0, sd = 0.5)
data = data.frame(x,y)
deg = 9
run = 100
train.mse = rep(NA, deg)
test.mse = rep(NA, deg)
tmp.train.mse = rep(NA, run)
tmp.test.mse = rep(NA, run)
par(mfrow=c(3,3), mai=c(0.4,0.4,0.4,0.4))
for(i in 1:deg){
plot(x, y, col="red", main="", lwd=2)
for(j in 1:run){
train = sample(1:nrow(data), size = floor(0.7*nrow(data)))
train_ = data[train,]
test_ = data[-train,]
fit = lm(y ~ poly(x, i), data = train_)
tmp.train.mse[j] = sum(fit$residuals^2)/nrow(train_)
test.y = predict.lm(fit, newdata = test_, interval = "prediction")
tmp.test.mse[j] = mean(test.y[,1] - test_$y)^2
}
train.mse[i] = mean(tmp.train.mse)
test.mse[i] = mean(tmp.test.mse)
train.mse[i]
test.mse[i]
fit.all = lm(y ~ poly(x,i), data = data)
tmp = seq(from = 0, to = 1, by = .01)
tmp.fit = predict(fit.all, newdata = data.frame(x=tmp), interval = "prediction")
lines(tmp, tmp.fit[,1], lwd= 2)
text(.2,max(y)-.5, paste("n =", i), bty="n", lwd=2, cex=1.4)
}
par(mfrow=c(1,1), mar= c(5, 4, 4, 2) + 0.1)
ylim = max(train.mse,test.mse)
plot(1:deg, train.mse, type="o", col="red", cex.lab=1.5,
xlab= "Degree of polynomial", ylim=c(0,ylim), lwd=2, ylab="mse")
lines(1:deg, test.mse, type = "o", col = "green", lwd=2)
legend("topleft", col=c("red","green"), cex=1.3,legend=c("train mse", "test mse"), bty="n", lwd=2)
Exercise-16. Read about the packages alr3 and alr4. This will be very helpful while discussing nonlinear regression.
Exercise-17. Solve the exponential, logistic, theta logistic, gompertz and Allee model analytically and sketch the solution curve in R without using the function ode from deSolve package.
Nonlinear Regression
We have already discussed some aspects of linear regression using R particularly the use of the function lm in R. This function is particularly useful for the linear models or nonlinear models which can be converted to a linear model. However, there are many models that can not be written as a linear function of the parameters. In such a case, nonlinear regression modeling is required. In R the nls function takes care of the estimation of parameters for nonlinear function. Since we have already learned about the different growth curve models, I shall use these models as nonlinear function whose parameters are to be estimated from given data. To do this, we shall write functions for all the models. I am listing the codes which are mainly written by the M.Sc students. The idea is that, for each of the growth model X_t = f(t), we shall simulate the data at different time points. Then we fit the model using the simulated data and obtain the parameter estimates using nls function. We shall compare the true curve and the estimated curve by means of graphs. Note that, we have already solved the growth equations in Exercise-16 to obtain X_t as a function of t. Run the following codes, the idea will be clear.
Exponential Model
x0 = 5; r=0.6;
t = seq(0, 5, by = 0.1)
x = x0*exp(r*t) + rnorm(length(t), 0, 9)
data = data.frame(t, x)
plot(t, x, type = "p", xlab = "time", ylab = "Size")
legend('topleft', legend = "Exponential Model", bty="n")
curve(x0*exp(r*x), col = 4, lwd= 2, add = TRUE)
ExpModel = function(t, x0, r){
x0*exp(r*t)
}
fit = nls(x ~ ExpModel(t, x0, r), data = data, start = list(x0 = 6, r = 0.7))
summary(fit)
lines(t, fitted.values(fit), col = 2, lwd = 2)
legend("bottomright", legend = c("Estimated", "True"), col = c(2, 4), lwd = c(2,2), bty = "n")
Logistic Model
LogisticModel = function(t, x0, r, K){
K/(1+(K/x0 - 1)*exp(-r*t))
}
x0 = 5; r = 0.8; K = 20;
t = seq(0, 15, by = 0.3)
x = LogisticModel(t, x0 = 5, r = 0.8, K = 20) + rnorm(length(t), 0, 2)
data = data.frame(t, x)
plot(t, x, type = "p", xlab = "time", ylab = "Size")
legend('topleft', legend="Logistic Model", bty="n")
curve(K/(1+(K/x0 - 1)*exp(-r*x)), lwd = 2, col = 2, add = T)
fit = nls(x ~ LogisticModel(t, x0, r, K), data = data, start = list(x0 = 6, r = 1, K = 20) )
summary(fit)
lines(t, fitted.values(fit), col = 3, lwd = 3)
legend("bottomright", legend = c("Estimated", "True"), col = c(3, 2), lwd = c(2,2), bty = "n")
deltaMethod(fit,"1/r*log(K/x0-1)") #Estimate and variance of function of parameters using #delta method
theta-Logistic Model
ThetaLogisticModel = function(t, r, x0, K, theta){
K/(1+exp(-theta*r*t)*((K/x0)^(theta)-1))^(1/theta)
}
x0 = 5; r = 0.4; K = 20; theta = 0.7;
t = seq(0, 15, by = 0.3)
x = ThetaLogisticModel(t, r = 0.4, x0 = 5, K = 20, theta = 0.7) + rnorm(length(t), 0, 1)
data = data.frame(t, x)
plot(t, x, type = "p", xlab = "time", ylab = "Size")
legend('topleft', legend="theta-Logistic Model", bty="n")
curve(ThetaLogisticModel(x, r, x0, K, theta ), lwd = 2, col = 3, add = TRUE)
fit = nls(x ~ ThetaLogisticModel(t, r, x0, K, theta), data = data,
start = list(x0 = 5, r = 0.4, K = 20, theta = 0.7))
lines(t, fitted.values(fit), col = 2, lwd = 2)
legend("bottomright", legend = c("Estimated", "True"), col = c(2, 3), lwd = c(2,2), bty = "n")
Gompertz Model
GompertzModel = function(t, x0, r, alpha){
x0*exp(r*(1-exp(-alpha*t))/alpha)
}
x0 = 5; alpha = 0.4; r = 0.5
t = seq(0, 15, by = 0.3)
x = GompertzModel(t, r = 0.5, x0 = 5, alpha = 0.4) + rnorm(length(t), 0, 1)
data = data.frame(t, x)
plot(t, x, type = "p", xlab = "time", ylab = "Size")
legend('topleft', legend = "Gompertz Model", bty = "n")
curve((x0*exp(r*(1-exp(-alpha*x))/alpha)), lwd = 2, col = 3, add = T)
fit = nls(x ~ GompertzModel(t, x0, r, alpha), data = data, start = list(x0 = 5, r = 0.5, alpha = 0.4))
lines(t, fitted.values(fit), col = 2, lwd = 2)
legend("bottomright", legend = c("Estimated", "True"), col = c(2, 3), lwd = c(2,2), bty = "n")
Bootstrapping nonlinear regression model
Note that we have already computed the bootstrap estimates of the parameters of the linear regression mdoel. The same thing can be done for nonlinear regression model also. First, we will show by explicitly writing codes and then we shall see the use of special packages in R to obtain the bootstrap distribution of the parameters of nonlinear models. We take the logistic model as a case study. It can be easily modified for other models.
LogisticModel = function(t, x0, r, K){
K/(1+(K/x0 - 1)*exp(-r*t))
}
control = nls.control(maxiter = 500)
x0 = 5; K = 20; r = 0.8
t = seq(0, 15, by = 0.3)
x = LogisticModel(t, x0 = 5, r = 0.8, K = 20) + rnorm(length(t), 0, 2)
data = data.frame(t, x)
B = 500
boot.x0 = numeric(B); boot.r = numeric(B); boot.K = numeric(B)
for(i in 1:B){
brow = sample(1:nrow(data), replace = T)
newdata = data[brow,]
fit = nls(x~LogisticModel(t, x0, r, K), control=control,
data = newdata, start = list(x0 = 5, K = 20, r = 0.8))
boot.x0[i] = coef(fit)[1]
boot.K[i] = coef(fit)[2]
boot.r[i] = coef(fit)[3]
}
par(mfrow=c(3,3))
hist(boot.x0, probability = T, col = "grey", main = "histogram of x0")
plot(boot.x0, boot.K, type = "p", col = "red")
plot(boot.x0, boot.r, type = "p", col = "red")
plot(boot.K, boot.x0, type = "p", col = "red")
hist(boot.K, probability = T, col="grey", main = "histogram of K")
plot(boot.K, boot.r, type = "p", col="red")
plot(boot.r, boot.x0, type = "p", col="red")
plot(boot.r, boot.K, type = "p", col="red")
hist(boot.r, probability = T, col="grey", main = "histogram of r")
Note: nls.control() can be used to control the estimation algorithm used by nls function. Type nls.control() in the console and check.
Note: Available methods for extracting various quantities from and nls() model fit.
a) anova Comparison of several nested model fits
b) coef Parameter estimates from model fit
c) confint Profile likelihood based confidence intervals from MASS package
d) fitted The prediction based on the fitted model on the used data set
e) residuals Residuals from the model fit
f) summary Summary of the output
g) vcov Variance-Covariance matrix of the parameter estimates.
Real Data Analysis
Data were collected on length of fish, Cirrhinus mrigala, at 12 consecutive time points for each of the four equispaced directions to be referred to as A, B, C and D, emanating at 45◦ from the center of the lake to its four corners. At each time point, 12 measurements were available. Fishes were combined in the hoop nets placed at an equal radial distance from the center in each of these directions.
setwd("Folder Path") #Set working directory.
#LOCATION A
data.A = read.table("LocationA.txt", header = FALSE, sep = "\t", na.strings = "NA")
time = 1:12
class(data.A)
dim(data.A)
matplot(time, t(data.A), type = "p", col = 1, ylab = "Observed length (cm)", xlab = "Time (week)", main = "Location A")
x = colMeans(data.A)
t = 1:12
data = data.frame(t, x)
fit = nls(x ~ ExpModel(t, x0, r), data = data, start = list(x0 = x[1], r = 0.7))
summary(fit)
lines(t, fitted.values(fit), col = "orange", lwd = 2)
resid(fit)
mse = mean(resid(fit)^2)
fit1 = nls(x ~ LogisticModel(t, x0, r, K), data = data, start = list(x0 = x[1], r = 0.7, K = x[12]) )
summary(fit1)
lines(t, fitted.values(fit1), col = "red", lwd = 2)
resid(fit1)
mse1 = mean(resid(fit1)^2)
mse1
fit2 = nls(x ~ ThetaLogisticModel(t, r, x0, K, theta), data = data, start = list(x0 = x[1], r = 0.4, K = x[12], theta = 1))
summary(fit2)
lines(t, fitted.values(fit2), col = "blue", lwd = 2)
resid(fit2)
mse2 = mean(resid(fit2)^2)
fit3 = nls(x~GompertzModel(t, x0, r, alpha), data = data, start = list(x0 = x[1], r = 0.5, alpha = 0.4))
lines(t, fitted.values(fit3), col = "green", lwd = 2)
resid(fit3)
mse3 = mean(resid(fit3)^2)
legend("topleft", legend = c("Exponential", "Logistic", "Theta-logistic", "Gompertz"), col = c("orange","red", "blue", "green"), lwd = c(2,2,2,2), bty = "n")
mse
mse1
mse2
mse3
#If MSE is considered as the measure of goodness of fit then theta-logistic model is the best model describing the data.
AIC(fit) #Akike Information criteria
AIC(fit1)
AIC(fit2)
AIC(fit3)
#A model with lower AIC id preferred than the model with higher AIC.
#If AIC is considered as the measure of goodness of fit then theta-logistic model is the best model describing the data.
summary(fit2) #Summary of the best model.
More conceptual problems:
In the above in some simulation exercise, we have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. Explore this fact using the following simulated data. The problem is from James et al. (2013), Chapter 6.
Generate a data set with p = 20 features, n = 1000 observations, and an associated quantitative response vector generated according to the model Y = Xbeta+eps, where beta has some elements that are exactly equal to zero.
Split your data set into a training set100 observations and a test set containing 900 observations.
Perform best subset selection on the training set, and plot the training MSE associated with the best model of each size.
Plot the test set MSE associated with the best model of each size
For which model size does the test set MSE take on its minimum value? Comment on your results. Id it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in 1 until you come up with a scenario in which the test MSE is minimized for an intermediate model size.
How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.
Create a plot displaying \sqrt{\sum_{i=1}^p (\beta_j-\hat{\beta_j^r})^2} for a range of values of r, where \hat{\beta_j^r} is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?
More conceptual problems:
You are expected to run the following codes. Some nice graphics are also there.
Auto = read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.data", head=T)
library(ISLR)
data(Auto)
names(Auto)
head(Auto)[,1:4]
dim(Auto)
tail(Auto)[,1:4]
class(Auto)
Auto = na.omit(Auto) # function simply removes the row which has missing values
dim(Auto)
summary(Auto) # to get the descriptive summary for each of the variables in the Auto dataset
plot(Auto$cylinders , Auto$mpg )
attach(Auto) # See the help file for attach() function
plot(cylinders, mpg)
plot(cylinders, mpg, main="cylinders and mpg", xlab="number of cylinders", ylab="miles per gallon", col="red", lwd=4, cex.lab=2, cex.main = 3 )
plot(cylinders, mpg, main="cylinders and mpg", xlab="number of cylinders", ylab="miles per gallon", col="red", lwd=4, cex.lab=2, cex.main = 3 )
The cylinders variable is stored as a numeric vector. R has treated it as quantitative. Only a *small* number of possible values for cylinders. One may prefer to treat it as a qualitative variable
cylinders = as.factor(cylinders)
class(cylinders)
In real data many times, categorical variables are stored with 0 and 1. (Married = 0, Unmarried=1; High = 2, Medium = 1, Low=0). as.factor( ) function can be useful in these cases. Let us plot again. - The plot( ) produces boxplot. The variable plotted on the x-axis is categorial. Then boxplots will automatically be produced by the plot() function. As usual, a number of options can be specified in order to customize the plots.
plot(cylinders, mpg)
plot(cylinders , mpg , col ="red ")
plot(cylinders, mpg ,col ="grey" , varwidth =FALSE, horizontal =T)
plot(cylinders , mpg , col ="green", varwidth =T)
plot(cylinders, mpg, col =c("red", "green","blue","grey","yellow"), varwidth =F, xlab=" cylinders ", ylab ="MPG ")
boxplot(mpg~cylinders, data=Auto, col =2:6, xlab="cylinders", ylab ="MPG")
boxplot(mpg~cylinders, data=Auto, col =2:6, notch=TRUE, xlab="cylinders", ylab ="MPG ")
boxplot(mpg, data=Auto, horizontal=TRUE, lwd=3, cex.axis=2, main="box plot of mpg", cex.main=2); box(lwd=3)
text(x=rep(c(13,20,26,37),2), y=c(rep(1.4,4),rep(.7,4)), labels=c(rep("25%",4),1:4), cex=2)
abline(v=quantile(mpg), lty=2, col=2:6, lwd=2.5)
Recall that, cylinders is now a categorical variable. We want to draw a barplot of the variable cylinders table(cylinders) compute the frequencies of each class
barplot(table(cylinders) , cex.axis = 2, cex.label=2, beside=T)
barplot(freq, col=2:6, main="barplot of cylinders")
hist(mpg, col = 2)
hist(mpg,col=3,prob=T)
pairs(Auto) function creates a scatterplot matrix. A scatterplot for every pair of variables would be helpful. A subset of the variables also can be plotted.
pairs(Auto)
pairs(~mpg + displacement + weight, Auto, col=2)
library(car)
scatterplotMatrix(~mpg+displacement + weight, data=Auto, diagonal = "histogram")
scatterplotMatrix(~mpg+displacement + weight, data=Auto, diagonal = "density")
scatterplotMatrix(~mpg+displacement + weight, data=Auto, diagonal = "boxplot")
help(scatterplotMatrix)
library(UsingR)
norm = rnorm(200)
simple.hist.and.boxplot(norm, main="symmetric")
expo = rexp(200, rate=5)
simple.hist.and.boxplot(expo, main="long tailed")
unif = runif(200, min=-5,max=5)
simple.hist.and.boxplot(unif, main="short tailed")
logis = rlogis(200, 0, scale = 1)
simple.hist.and.boxplot(logis, main="both tail long")
par(mfrow=c(1,2))
simple.violinplot(mpg~ cylinders, data = Auto, col="lightgrey")
box(lwd=3)
simple.densityplot(mpg~ cylinders, data=Auto, col=2, lwd=2)
box(lwd=3)
plot(horsepower, mpg)
identify(horsepower, mpg, name) # Interactive plot in R. Chosing points by name
cyl.freq = table(cylinders)
pie(cyl.freq)
names = c("cylinder = 8 \n", "cylinder = 4 \n", "cylinder = 6 \n", "cylinder = 3 \n", "cylinder = 5 \n")
colors = c("red","green","blue", "yellow", "grey")
pie(cyl.freq, col=colors, labels =names)
pct = round(cyl.freq/sum(cyl.freq)*100)
pct = paste("(", pct, "%",")", sep=" ")
pctnames = paste(names, pct )
pie(cyl.freq, col=rainbow(length(cyl.freq)), labels =pctnames)
cyl.freq = table(cylinders)
names = c(" cylinder = 8 \n", "cylinder = 4 \n", "cylinder = 6 \n ", "cylinder = 3 \n ", " cylinder = 5 \n")
colors = c("red","green","blue", "yellow", "grey")
pie(cyl.freq, col=colors, labels =names)
pct = round(cyl.freq/sum(cyl.freq)*100)
pct = paste("(", pct, "%",")", sep=" ")
pctnames = paste(names, pct )
pie(cyl.freq, col=colors, labels =pctnames)
library(plotrix)
pie3D(cyl.freq,labels=pctnames,explode=0.1, main="Pie Chart of cylinders ", labelrad = 1.6)
library(lattice)
p1 = histogram(~mpg | factor(cylinders), data=Auto)
print(p1)
# Surface plots
x = seq(from = -2, to= 2, by = 0.1)
y = seq(from = -1, to = 1, by = 0.1)
model = function(x,y){
3+x+y
}
z=outer(x,y,model)
persp(x,y,z, theta=50, col="red")
model = function(x, y){
3+3*x^2+.1*y^3
}
z = outer(x, y, model)
persp(x,y, z, theta=50, ticktype="detailed", col="red")
persp(volcano,col="grey")
contour(volcano)
library(vioplot)
vioplot(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width, col="grey")
library(lattice)
wireframe(volcano, shade = TRUE, aspect = c(61/87, 0.4), light.source =c(10,0,10))
library(lattice)
cloud(Sepal.Length ~ Petal.Length * Petal.Width | Species, data = iris, screen =list(x = -90, y = 70), distance = .4, zoom = .6)
cloud(Sepal.Length ~ Petal.Length * Petal.Width, data = iris, screen = list(x =-90, y = 70), distance = .4, zoom = .6)
library(playwith)
library(zoo)
playwith(xyplot(sunspots ~ yearmon(time(sunspots)), xlim = c(1900, 1930), type ="l"), time.mode = TRUE)
Three sessions (4 hours each) on R Programming during Software Lab-II (Semester-III, 2017):
setwd("C:/Users/admin/Desktop/RLecture")
getwd()
#Simulating random numbers following some distribution
#Generating uniform random numbers
set.seed(1) #Generate same set of random numbers
x = runif(n = 100, min=0, max = 1)
print(x)
x = runif(100); print(x)
x = runif(100, 1, 10); print(x)
#Some important functions
range(x)
max(x)
min(x)
mean(x)
sd(x)
is.na(x)
sum(is.na(x))
summary(x)
#Histogram
set.seed(1)
x = runif(10000, min = 1, max = 5)
hist(x, probability = T)
abline(h = 1/(5-1), lwd =2, col = "red")
#lwd is used to fix the width of the line
#Multiple plots in one plot are
par(mfrow = c(2,2))
n = c(10, 100, 1000, 10000)
for(i in 1:length(n)){
set.seed(1)
x = runif(n[i], min = 1, max = 5)
print(summary(x))
hist(x, probability = T)
abline(h = 1/(5-1), lwd =2, col = "red")
}
#Generating random numbers from exponential distribution
par(mfrow = c(1,1))
lambda = 2
curve(lambda*exp(-lambda * x), from = 0, to=5 )
lambda = 1
curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)),
lwd =2, col = "red", ylim=c(0,2), main = "Exponential density function")
lambda = 2
curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)),
lwd =2, col = "blue", add = T)
lambda = 3
curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)),
lwd =2, col = "green", add = T)
legend = c(expression(paste(lambda," = 1")), expression(paste(lambda," = 2")), expression(paste(lambda," = 3")))
legend("topright", legend= legend, bty="n", lwd = c(2,2,2), col = c("red","blue", "green"))
# Checking for density function
lambda = 2
f = function(x){
lambda*exp(-lambda*x)
}
I = integrate(f, lowe = 0, upper = Inf)
print(I)
str(I) # What are information available in I
print(I$value)
# Expected value
fx = function(x){
x*f(x)
}
I1 = integrate(fx, lower = 0, upper = Inf)
print(I1)
fx2 = function(x){
x^2*f(x)
}
I2 = integrate(fx2, lower = 0, upper = Inf)
print(I2)
fvar = I2$value - I1$value^2
print(fvar)
set.seed(1)
x = rexp(100)
print(x)
hist(x, probability = T)
lambda = 1
curve(lambda*exp(-lambda * x), from = 0, to=5, ylab = expression(lambda *exp(-lambda*x)), lwd =2, col = "red", add =T)
#Generating random numbers from normal distribution
mu = 0
sigma = 1
f = function(x){
(1/sqrt(2*pi*sigma^2))*exp(-(x-mu)^2/(2*sigma^2))
}
I = integrate(f, -Inf, Inf)
print(I)
curve(f(x), -4, 4, lwd = 2, col = 2, main = expression(frac(1,sqrt(2*pi*sigma^2))*e^(-frac((x-mu)^2,2*sigma^2))))
legend(1.5, 0.4,legend=expression(paste(mu,"=0, ", sigma, "=1")), bty = "n", col = 2, lwd = 2)
set.seed(1)
x = rnorm(10000)
hist(x, probability = T, xlim = c(-3,3), ylim = c(0, 0.5))
curve(f(x), -3, 3, add = T, col = "blue")
mean(x)
sd(x)
#--------------------------------------------------------------------------------------
Central limit theorem.
#--------------------------------------------------------------------------------------
x1 = rexp(1000) # Generate 1000 random numbers from exp(1) distribution
x2 = rexp(1000)
x3 = rexp(1000)
x4 = rexp(1000)
hist(x1, prob = T) # Histogram of x2, x3,... will be similar, hence ommitted
par(mfrow = c(2,2)) # Multiple plots in a single plot window
y1 = x1
hist(y1, prob = T) # Histogram of Y_1
y2 = (x1 + x2)/2
hist(y2, prob = T) # Histogram of Y_2
y3 = (x1+x2+x3)/3
hist(y3, prob = T) # Histogram of Y_3
y4 = (x1+x2+x3+x4)/4
hist(y4, prob = T) # Histogram of Y_4
Function_CLT = function(n){
set.seed(1)
x = matrix(NA, nrow = n, ncol = 100)
for(i in 1:n){
x[i,] = rexp(100)
}
y = matrix(NA, nrow = n, ncol = 100)
for(i in 1:n){
for(j in 1:100){
y[i,j] = sum(x[1:i,j])/i
}
}
par(mfrow = c(3,3))
for(i in 1:n)
hist(y[i,], prob = T, col= "lightgrey", main = paste("n=",i), xlab = expression(paste(Y)))
}
Function_CLT(9)
# Extend the above function to include the user input for the choice of distribution of X1, dist =
# "uniform", "binomial", "poisson", "exponential".
# Simulating the data from the normal distribution. You must play with the functions rnorm(), runif(), rexp() etc.
#Normal distribution
x = rnorm(100)
hist(x, prob = T)
# Distribution function of normal random variable
y = seq(-4, 4, length.out = 1000)
pnorm(0)
z = pnorm(y)
z
plot(y, z, ylab = expression(P(X<=y)))
# Density function of normal random variable
y = seq(-4, 4, length.out = 1000)
dnorm(0)
z = dnorm(y)
z
plot(y, z, ylab = expression(f[X](x)), xlab = expression(x))
# Overlay density function and histogram
x = rnorm(100)
hist(x, prob = T)
lines(sort(x), dnorm(sort(x)), col = "red", lwd = 2)
# Quantile function
qnorm(0.5)
# Demonstration for exponential distribution
x = rexp(100)
hist(x, prob = T)
y = seq(0,5, length.out = 100)
lines(y, dexp(y), col = "red", lwd = 2)
plot(y, pexp(y), main = "Distribution function of X~exp(1) ")
# Simple linear Regression
b0 = 1; b1 = 2 # Fix the population regression coefficients
x = seq(1, 10, length.out = 30) # Input variables or the explanatory variable
y = b0 + b1* x + rnorm(length(x)) # Generate data values for the response variable
plot(x, y, col = "red") # The scatter plot
data = data.frame(x = x, y = y)
summary(data)
fit = lm(y ~ x, data = data)
summary(fit)
abline(fit, lwd =2, col = "blue")
legend("topleft",legend=expression(paste("y=", 1.19 + 1.95*x)), bty = "n")
par(mfrow = c(2,2))
plot(fit)
residuals = y - fitted.values(fit)
plot(fitted.values(fit), residuals, xlab = "fitted values")
abline(h =0, col = "red", lwd = 2)
hist(residuals, prob = T)
std_e = as.numeric(scale(residuals))
hist(std_e,prob = T)
str(fit)
fit$coefficients
fit$residuals
fit$fitted.values
#-----------------------------------------------------------------------------------------
Regression with more than one explanatory variables. The function lm() allows to use multiple input/ explanatory variables also.
#-----------------------------------------------------------------------------------------
library(MASS)
data("Boston")
View(Boston)
fix(Boston)
edit(Boston)
help(Boston)
class(Boston)
names(Boston)
head(Boston, n =10)
tail(Boston)
summary(Boston)
dim(Boston)
sum(is.na(Boston))
pairs(Boston[,1:6])
fit = lm(medv ~ dis + nox + crim, data = Boston)
summary(fit)
#----------------------------------------------------------------------------------------------
In the following section, we shall discuss the fitting of nonlinear mathematical function to the data. The equations may be converted to a linear equation and then apply the linear regression model
#------------------------------------------------------------------------------------------------
set.seed(12345)
x = seq(0, 5, length.out = 50)
b0 = 0.8; b1 = 0.1 # Fix the value of the population parameters
y = b0*exp(b1*sqrt(x) + rnorm(length(x), mean = 0, sd = 0.01)) # Generate the response variable
plot(x, y, lwd = 2, col = "red")
d = data.frame(x=x, y= y )
#--------------------------------------------------------------------------------
Many nonlinear functions can be converted to a linear function by taking some suitable transformation of the regressor and the response variables. Let us take the following transformation.
#-------------------------------------------------------------------------------
Y = log(y) # Transformed response variable
X = sqrt(x) # Transformed regressor variable
data = data.frame(X, Y)
fit = lm(Y~X, data = data) # Fit simple linear regression to the transformed data
summary(fit) # Look at the summary of the fit
B0 = coef(fit)[1] # Estimate of the parameters for the transformed model (linear)
B1 = coef(fit)[2]
plot(X, Y, col = "red", lwd = 2)
abline(fit, col = "blue", lwd =2)
# Let us now get back to the nonlinear equation and see the fitting
b0_hat = exp(B0); #Retrieve the estimates of the parameters of the nonlinear model
print(b0_hat)
b1_hat = B1;
prin(b1_hat)
# Let us now plot the nonlinear equation to the scatter plot of the original variables
main = expression(paste(y, " = ",b[0]*e^(b[1]*sqrt(x)) ))
plot(x, y, lwd = 2, col = "red", main = main)
curve(b0_hat*exp(b1_hat*sqrt(x)), 0, 5, lwd = 2, col = "blue", add = TRUE)
#-------------------------------------------------------------------------------------
The nonlinear regression function may not be convertible to a linear regression problem by using some suitable mathematical transformation. In that case we have to use some numerical routine to estimate the parameters for the nonlinear model. This can be done in R by using the nls() fuction. The estimates are obtained by using multivariate Newton Raphson method. Refer the discussion in class for score equation and the Hessian matrix
#-----------------------------------------------------------------------------------
fit = nls(formula = y~ b0*exp(b1*sqrt(x)), data = d, start = list(b0 = 0.7, b1 = 0.13))
summary(fit)
yhat = fitted.values(fit)
plot(x,y, lwd = 2, col = "red")
lines(x, yhat, col = "blue", lwd = 2)
# Investigate the residuals
plot(yhat, residuals(fit), lwd = 2)
e = residuals(fit)
e_std = (e-mean(e))/sd(e)
qqnorm(e_std)
qqline(e_std, col ="red", lwd = 2)
#-------------------------------------------------------------------------------------
# nls() allows the user to deal with the situation of heteroschedasticity, which means that the
# variance of the errors are not same at all input levels In nls() try changing the option weights =
# Refer to the discussion in the class for choosing the weight appropriately
#--------------------------------------------------------------------------------------
fit = nls(formula = y~ b0*exp(b1*sqrt(x)), data = d, start = list(b0 = 0.7, b1 = 0.13), weights = x^(1/5))
plot(fitted.values(fit), residuals(fit))
#----------------------------------------------------------
# Exercise 8, ISLR, CHAPTER-II
#----------------------------------------------------------
setwd("C:/Users/admin/Desktop/RLecture07102017")
college = read.csv("College.csv")
names(college)
univ.names = college$X
print(univ.names)
college = college[,-1]
row.names(college) = univ.names
View(college)
summary(college)
pairs(college[,1:5])
# boxplot
head(college[,c("Private", "Outstate")])
attach(college)
plot(Private, Outstate, lwd = 2, col = "lightgrey",
main = "Boxplot of Outstate versus Private")
Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
college = data.frame(college, Elite)
names(college)
table(Elite)
plot(college$Elite, Outstate, col = "red", lwd =2, main = "Boxplot of Outstate versus Elite")
par(mfrow= c(2,2))
hist(Outstate, prob = T, col = "lightgrey")
hist(Grad.Rate, prob = T, col = "lightyellow")
hist(S.F.Ratio, prob = T, col = "green")
hist(Books, prob = T, col = "blue")
library(ISLR)
data("Auto")
dim(Auto)
names(Auto)
complete.cases(Auto)
sum(complete.cases(Auto))
x = numeric(length = length(names(Auto))); length(x)
x = numeric(dim(Auto)[2]); length(x)
for(i in 1:length(x)){
x[i] = ifelse(is.numeric(Auto[,i]), 1, 0)
}
sum(x)
sapply(Auto, FUN = is.numeric)
# Range of the quantitative variables
which(x==0)
sapply(Auto[,-c(which(x==0))], FUN = range)
sapply(Auto[,-c(which(x==0))], FUN = mean)
sapply(Auto[,-c(which(x==0))], FUN = sd)
newAuto = Auto[-c(10:85),]
sapply(newAuto[,-c(which(x==0))], FUN = range)
sapply(newAuto[,-c(which(x==0))], FUN = mean)
sapply(newAuto[,-c(which(x==0))], FUN = sd)
table(Auto$cylinders)
table(Auto$origin)
# We treat them as factor variables because number of unique values for origin is 3 and for cylinders is 5.
Auto$cylinders = as.factor(Auto$cylinders)
Auto$origin = as.factor(Auto$origin)
# Following boxplots may be informative
plot(Auto$cylinders, Auto$mpg, col = "lightgrey", pch =20, cex=1, ylab = "mpg", xlab = "Cylinders")
plot(Auto$cylinders, Auto$displacement, col = "lightgrey", pch =20, ylab = "displacement", xlab = "Cylinders")
plot(Auto$cylinders, Auto$horsepower, col = "lightgrey", pch =20, ylab = "horsepower", xlab = "Cylinders")
plot(Auto$cylinders, Auto$weight, col = "lightgrey", pch =20, ylab = "weight", xlab = "Cylinders")
plot(Auto$origin, Auto$mpg, col = "lightgrey", pch =20, cex=1, ylab = "mpg", xlab = "origin")
plot(Auto$origin, Auto$displacement, col = "lightgrey", pch =20, ylab = "displacement", xlab = "origin")
plot(Auto$origin, Auto$horsepower, col = "lightgrey", pch =20, ylab = "horsepower", xlab = "origin")
plot(Auto$origin, Auto$weight, col = "lightgrey", pch =20, ylab = "weight", xlab = "origin")
pairs(Auto)
#Scatterplot suggests that the quantitative variable year does not seem important
pairs(Auto[,c("mpg", "displacement","horsepower","weight", "acceleration")])
# The predictors weight and horsepower seem to be linearly related
# horsepower and displacement seem to be linearly related
# mpg seem to be nonlinearly related to displacement, horsepower and weight
# acceleration does not seem to impact mpg
# There seems to be relationship between horsepower and accelaration
# There seems to be relationship between displacement and accelaration
# Weight and accelration does not seem to impact each other
# Analyzing the correlation matrix visually
cor(Auto[,c("mpg", "displacement","horsepower","weight", "acceleration")])
library(corrplot)
d = Auto[,c("mpg", "displacement","horsepower","weight", "acceleration")]
class(d)
M = cor(as.matrix(d))
print(M)
par(mfrow= c(1,1))
corrplot(M)
corrplot(M, method = c("number"))
corrplot(M, method = "ellipse", type="upper")
help("corrplot")
# We comment that the horsepower and displacement are highly correlated (cor = 0.9)
# We comment that the weight and displacement are highly correlated (cor = 0.93)
# mpg is negatively correlated with displacement (r=-0.81), horsepower(-0.78), weight(-0.83)
# The plots suggest that displacement, horsepower and weight may be used to predict mpg.
fit = lm(mpg~ displacement + horsepower + weight, data = Auto)
summary(fit)
# Let us directly apply the formula from multiple linear regression and tally them with the output of lm() function
n = nrow(Auto)
p = 3
X = cbind(rep(1, n), Auto$displacement, Auto$horsepower, Auto$weight)
head(X)
Y = Auto$mpg
beta_hat = numeric(p+1)
beta_hat = solve(t(X)%*%X)%*%t(X)%*%Y
beta_hat = as.vector(beta_hat)
abs(beta_hat-coef(fit))<10^(-6)
error = Y - X%*%beta_hat
residuals(fit)
plot(error, residuals(fit))
var_error = sum(error^2)/(n-p-1)
vcov_beta_hat = var_error*solve(t(X)%*%X)
vcov_beta_hat
var_beta_hat = diag(vcov_beta_hat)
sqrt(var_beta_hat)
summary(fit)
#----------------------------------------------
###################################
# Section for best subset selection
###################################
setwd("C:/Users/admin/Desktop/RLecture07102017")
getwd()
Boston = read.csv(file = "boston.csv")
names(Boston)
names(Boston) = c("crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")
length(names)
dim(Boston)
head(Boston)
summary(Boston)
fit = lm(cost~., data = Boston)
summary(fit)
library(leaps)
fit = regsubsets(cost~., data = Boston, nvmax = 13, method = "forward")
summary(fit)
bestfit = summary(fit)
plot(bestfit$rsq, xlab = "Number of predictors", col="grey",
ylab = "R-square",pch=20, lwd=2, type="l", main = "Best fit R-square versus Number of predictors")
ind = which.max(bestfit$rsq)
points(ind, bestfit$rsq[ind], col = "red", lwd=3,
pch=20, cex =2)
plot(bestfit$adjr2, xlab = "Number of predictors", col="grey",
ylab = "Adjusted R-square",pch=20, lwd=2, type="l", main= "Best fit Adj R^2 versus Number of predictors")
ind = which.max(bestfit$adjr2)
points(ind, bestfit$adjr2[ind], col = "red", lwd=3,
pch=20, cex =2)
plot(bestfit$cp, xlab = "Number of predictors", type="l",
ylab = "Mallow's Cp",pch=20, lwd=2, col ="grey",
main = "Best fit Cp versus Number of predictors")
ind = which.min(bestfit$cp)
points(ind, bestfit$cp[ind], col = "red", lwd=3, pch=20, cex =2)
plot(bestfit$rss, xlab = "Numbero of predictors", ylab ="RSS",
type="l", col ="grey", lwd=2,
main = "Best fit RSS versus Number of predictors")
ind = which.min(bestfit$rss)
points(ind, bestfit$rss[ind], col="red", lwd=2, pch=20)
coef(fit,1)
coef(fit,2)
coef(fit,10)
coef(fit,11)
coef(fit, ind )
# Visual representation
plot(fit, scale = "r2")
plot(fit, scale = "adjr2")
# Polynomial regression and best fitting
x = seq(1, 5, length.out = 100)
y = 2 + 1.5*x^2 + rnorm(length(x), 0, 4.5)
d = data.frame(x,y)
plot(x,y)
# Fitting a polynomial regression with high noise data
fit = lm(y~ x + I(x^2), data = d)
summary(fit)
curve(coef(fit)[1] + coef(fit)[2]*x + coef(fit)[3]*x^2, 0,5,
add= TRUE, col = "red", lwd=2 )
# Fitting a polynomial regression with low noise data
x = seq(1, 5, length.out = 100)
y = 2 + 1.5*x^2 + rnorm(length(x), 0, 2.0)
d = data.frame(x,y)
plot(x,y)
fit = lm(y~ x + I(x^2), data = d)
summary(fit)
curve(coef(fit)[1] + coef(fit)[2]*x + coef(fit)[3]*x^2, 0,5,
add= TRUE, col = "red", lwd=2 )
library(ISLR)
fit = lm(y~poly(x, 2, raw = TRUE), data = d)
summary(fit)
# Train test split
x = seq(1,5, length.out = 100)
y = 2 + 1.5*x^2 + rnorm(length(x), 0, sd = 6.0)
d = data.frame(y,x)
plot(x,y)
train = sort(sample(1:nrow(d), size = ceiling(nrow(d)*0.6), replace = FALSE))
train
train_data = d[train,]
test_data = d[-train,]
deg = 1:10
train_mse = numeric(length = length(deg))
test_mse = numeric(length = length(deg))
for(i in deg){
fit = lm(y~poly(x, i, raw = TRUE), data = train_data)
train_mse[i] = sum((fit$residuals)^2)/length(train)
pred_y = predict(fit, newdata = test_data)
test_mse[i] = sum((test_data$y - pred_y)^2)/nrow(test_data)
}
plot(train_mse, col= "red", type="l", lwd=2, ylab= "MSE",
xlab="degree of polynomial", ylim=c(0,50))
lines(test_mse, col= "blue", lwd=2)
legend("topright", legend =c("train mse", "test mse"),
col=c("red","blue"), lwd=c(2,2), bty="n")
# Repeating the process multiple times
run = 40
deg = 1:10
train_mse = matrix(NA, nrow= run, ncol = length(deg))
test_mse = matrix(NA, nrow= run, ncol = length(deg))
for(i in 1:run){
train = sort(sample(1:nrow(d), size = ceiling(nrow(d)*0.8), replace = FALSE))
train
train_data = d[train,]
test_data = d[-train,]
for(j in deg){
fit = lm(y~poly(x, j, raw = TRUE), data = train_data)
train_mse[i,j] = sum((fit$residuals)^2)/length(train)
pred_y = predict(fit, newdata = test_data)
test_mse[i,j] = sum((test_data$y - pred_y)^2)/nrow(test_data)
}
}
ylim = c(min(train_mse),max(test_mse))
avg_train_mse = colMeans(train_mse)
avg_test_mse = colMeans(test_mse)
plot(avg_train_mse, col = "red", type="l", lwd=2, ylim=ylim)
lines(avg_test_mse, col= "blue", type="l", lwd=2)
legend("topright", legend=c("train mse", "test mse"),
lwd=c(2,2), col = c("red","blue"), bty="n")
ind= which.min(avg_test_mse)
points(ind, avg_test_mse[ind], pch=4, lwd=2, col= "green")
This is an M.Sc Lab course entitled "Bioinformatics and Statistics Lab - I". This lab is mainly designed to implement different techniques of Statistics in solving problems related to Biological Sciences. This course is also open for the Ph.D. students of this Institute. I have designed the course to cover the following topics. The software "R" will be extensively used throughout the lab sessions. Particular emphasis will be given to writing functions in R rather than theory. The platform RStudio will be extensively deployed to find its extended capabilities. R is really advanced in graphics, however, in this lab, you are encouraged to make use of basic graphics functions and use them to prepare plots. This will increase your expertise in R.
Introduction to R Programming, Installation, datatypes (vector, matrices, dataframe, list), plots, inbuilt functions, user-defined functions, document generation
Mathematical Programming using R, examples of numerical analysis (1 lecture with assignments)
Statistical Distribution in R (1 lecture with assignments)
Linear and Nonlinear Regression
Resampling techniques and model selection
Biological Growth Curve Models: A bit biological theory
Solving Differential Equations with R
Multivariate Statistical Analysis: Principal Component Analysis, Factor Analysis
References:
Applied Statistical Inference: Likelihood and Bayes by Leohard Held and Daniel Sabanes Bove, Springer-Verlag Berlin 2014
The R Student Companion, Brian Dennis, CRC Press, 2013.
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie and Tibshirani, Springer Text in Statistics 2013
Using R for Numerical Analysis in Science and Engineering by Victor A. Broomfield, CRC Press. Taylor and Francis Group 2014
A Primer of Ecology with R by M. Henry and H. Stevens, Springer 2009
Statistical Modeling: The Two Cultures by Leo Breiman, Statistical Science 2001, Vol. 16, No. 3, 199-231
Ackwledgement: My project student Steffi D'Souza (M.Sc. 2014-16) has contributed significantly in the resampling methods section. I am really thankful for her involvement.
Introduction to R Programming.
R was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland - Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original S language, John Chambers of Bell Labs. It is an effective data handling and storage facility.
It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from http://cran.r-roject.org/mirrors.html Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.
You are requested to copy the following codes in a new R script and the outputs are self-explainable. I will write the comments wherever necessary
x=1:10; print(x)
x=numeric(10); x
x=seq(from=1, to=10, by=2); x; help(seq)
x = seq(from = 1, to = 10, length.out = 100)
x=rep(0,10); x
x=rep(c(1,2),10); x
x=c(3,5,7,3,5,7.5,8,2) #c stands for concatenation
sum(x)
min(x)
max(x)
cumsum(x)
cumprod(x)
which.max(x)
which.min(x)
mean(x)
sd(x)
x>5
sum(x>5) #number of numbers which are greater then 5
rev(x)
length(x)
sort(x, decreasing = TRUE ); help(sort)
x[x>5] #gives the numbers which are greater then 5
sum(x[x>5])
all(x>0)
any(x>0)
any(x<0)
y = c(1.3,2.5, 4/7,9, 11, 22/7)
x[3]
y[3:6]
x+y
x-y
x*y
x/y
x=c(3,5,7,3,5,7.5,8,2,NA)
sum(x)
is.na(x) #gives missing value is there or not
sum(is.na(x)) #number of missing values
sum(!is.na(x))
x=x[!is.na(x)] #only non missing values are stored into x
square.x=x^2
square.x
sum(x^2)
class(x)
class(A)
class(C)
range(x)
summary(x)
#Matrices in R
A=matrix(c(1,2,3,4), nrow=2, ncol=2);A
B=matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE);B #Note the difference by using byRow = TRUE
C=matrix(c("a","b","c","d"),nrow=2, ncol=2);C
D=matrix(c(1,"a",2,"b"), nrow=2, ncol=2);D
#Note that the matrix D is generated with both numbers and characters. But if you print D you will find that all the entries are displayed as characters. Matrix can hold only same types of Data.
A=matrix(c(rep(1,4), rep(2,4), rep(3,4)), nrow=3, ncol=4, byrow=TRUE);A
rownames(A)=c("r1","r2","r3")
colnames(A)=c("c1","c2","c3","c4"); A
rowMeans(A); colMeans(A)
new_A=cbind(A,rowMeans(A)); new_A
colnames(new_A)=c(colnames(A),"Row Means")
new_A=rbind(new_A,c(colMeans(A),sum(A)))
summary(A); help(summary)
#Operations with Matrices
A=matrix(c(1,2,3,4), nrow=2, ncol=2);A
B = matrix(c(4,5,6,8), nrow = 2, ncol = 2); B
A*B # Elementwise multiplication
A%*%B # Matrix multiplication
diag(A) # diagonal entries of the matrix
x = c(1,2,3,5)
diag(x) # square matrix with diagonal entries with x
b = c(1,10)
solve(A, b)# Solve the linear equation of the form Ax = b
solve(A) # Inverse of the matrix A
t(A) # Transpose of a matrix
eigen(A); help(eigen) # Eigenvalues and eigenvectors of A
#DATA frames in R
age = c(26, 42, 45, 31, 56)
income = c(100, 140, 160, 92, 200)
married = c(FALSE,TRUE,FALSE,TRUE,TRUE)
name = c("James","David", "Lisa", "Mark", "Richard")
myData = data.frame(name, age, income, married)
class(myData); dim(myData)
names(myData); str(myData)
myData[,"name"]; myData[,1] # access the columns by name of by index of the column
blood = c("AB+", "A", "B", "O-", "B") # Adding new variables in your dataframe
newData = cbind(myData, blood) # cbind stands for column bind; add new rows by rbind
newData
edit(myData)
fix(myData)
data = newData[,-c(5)] # Deleting rows from a dataframe
data
summary(myData)
# Dataframe with missing values. The missing values are indicated by NA in R.
name=c("Bravo", "Anderson", "Gayle", "Gaurav", "Sunil")
age=c(24,33,15,20,NA)
income=c(1000,1070,1500,2500,NA)
marriage=c(TRUE, TRUE, FALSE, TRUE,NA)
mydata=data.frame(name, age, income, marriage)
mydata
summary(mydata); head(mydata,3)
mydata[,2]; mydata[,"name"]
mydata[,1]; mydata[2,]
is.na(mydata); sum(is.na(mydata))
complete.cases(mydata) #gives which row has missing value
mydata[complete.cases(mydata),] #give all rows which has complete information
mydata[!complete.cases(mydata),]
newdata=na.omit(mydata) #remove the rows which has missing values
newdata; dim(mydata); tail(mydata,2)
#Replacing the missing value
mydata[which(is.na(mydata$age)),"age"]=mean(mydata$age, na.rm=TRUE) #na.rm - removes NA value for age #column and then find the mean
mydata[which(is.na(mydata$income)),"income"]=mean(mydata$income, na.rm=TRUE)
mydata[which(is.na(mydata$marriage)),"marriage"]=TRUE
mydata
#table function
gender = factor(c(rep("female",91),rep("male",92)))
gender
table(gender)
gender = factor(gender, levels=c("male","female"))
table(gender)
gender = factor(gender, levels=c("Male","female"))
table(gender)
#List. Run the following code and understand the difference with matrix and dataframe
mylist = list(x, A, mydata)
mylist
mylist[[1]]
newlist=list(x, A, mydata, mylist)
newlist[[4]]
x=c(1,2,3,NA,3,5,8,NA,10,2) #Treating Missing Values
is.na(x); !is.na(x)
sum(is.na(x)) # Number of missing values in the column
sum(!is.na(x)) # Number of non-missing values in the column
#Some topics related to graphs (Exercise 12)
#Pie Charts
states<-c("Punjab","Haryana","Tamil Nadu","Andhra Pradesh", "Uttar Pradesh","Others")
procurement<-c(5486,1248,589,3987,1296,1654)
rice.proc<-data.frame(states,procurement)
rice.proc<-transform(rice.proc,percentage=(rice.proc$procurement/14260)*100)
rice.proc; attach(rice.proc)
pie(procurement,labels=c("Punjab(38.47%)","Haryana(8.75%)","Tamil Nadu(4.13%)", "Andhra Pradesh(27.96%)", "Uttar Pradesh(9.09%)","Others(11.60%)"), main="Procurement of Rice('000 tonnes)\nduring Oct. 1993 to Sept. 1994", col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3", "purple1","violetred4"))
detach(rice.proc)
#Barplot
consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),
nrow=2,ncol=5,byrow=T)
rownames(consumption)<-c("Rural","Urban")
colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")
consumption
barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n
during 1989-'90",legend.text=rownames(consumption))
Exercise-1. Replace the missing value with the number that appears the maximum number of times.
Exercise-2. Replace the missing values with the mean
all() # returns TRUE if all values are TRUE
any() # returns TRUE if any values are TRUE
args() # information on the arguments to a function
cat() # prints multiple objects, one after the other
cumprod() # cumulative product
cumsum() # cumulative sum
diff() # form vector of first differences
# N. B. diff(x) has one less element than x
history() # displays previous commands used
is.factor() # returns TRUE if the argument is a factor
is.na() # returns TRUE if the argument is an NA
# NB also is.logical(), is.matrix(), etc.
length() # number of elements in a vector or of a list
ls() # list names of objects in the workspace
mean() # mean of the elements of a vector
median() # median of the elements of a vector
order() # x[order(x)] sorts x (by default, NAs are last)
print() # prints a single R object
range() # minimum and maximum value elements of vector
sort() # sort elements into order, by default omitting NAs
rev() # reverse the order of vector elements
str() # information on an R object
unique() # form the vector of distinct values
which() # locates ’TRUE’ indices of logical vectors
which.max() # locates (first) maximum of a numeric vector
which.min() # locates (first) minimum of a numeric vector
with() # do computation using columns of specified data frame
#User Defined Function in R
#writing first function
myfun = function(){
return("Hellow World")
}
myfun()
myfun = function(x){
#The function returns mean of the input
return(mean(x))
}
myfun(1:10)
myfun = function(x){
return(rowSums(x))
}
m = matrix(1:9, nrow = 3, ncol = 3)
myfun(m)
newfun = function(x){
return(sum(x))
}
newfun(myfun(m))
#function to find root using bisection method
bisection.root = function(f , xmin, xmax, tol = 1e-5){
a = xmin; b = xmax;
if(f(a)==0)
return(cat("The root is ", a))
if(f(b) == 0)
return(cat("The root is ", b ))
while (b-a>tol) {
c = (a+b)/2
if(f(c) == 0)
return(cat("The root is ", c ))
if(f(a)*f(c)<0)
b = c
else
a = c
}
return(cat("The root is", (a+b)/2))
}
f = function(x){
x^2 - 1
}
bisection.root(f , 0.5, 1.7, tol = 1e-5)
newton.raphson = function(f, df, x0, tol = 1e-6, iteration=100){
if(f(x0)==0)
return(x0)
if(df(x0)==0)
print("Give another value for x0")
while( i <= iteration){
x1 = x0 - f(x0)/df(x0)
e = abs(x0-x1)
x0 = x1
if(e < tol)
return(c(x1, i, abs(x0-x1)))
i = i + 1
}
return(c(x1,iteration, abs(x0-x1)))
}
newton.raphson(x^2-2,2*x, 0.5)
Compute_mean_sd = function(x){
print("The Mean and S.D. are:")
return(c(mean(x),sd(x)))
}
Compute_mean_sd(1:10)
Exercise-3. Modify the function to ask for user input. Type help(scan).
x=scan()
Example-1: Write the function to check whether the roots of quadratic equation with coefficients a, b and c are real or imaginary
root.check = function(a,b,c){
d=b^2-4*a*c
if(d>0)
print("roots are real.")
else if(d<0)
print("roots are imaginary")
else
print("roots are equal")
}
root.check(1,6,1)
#Modification of the function asking for user input
user.root.check=function(){
coef=scan()
if(length(coef)<3){
print("Three numbers are required.")
return(NULL)
}
else if(length(coef)>3){
print("Only first three numbers will be used for computation")
}
d=coef[2]^2-4*coef[1]*coef[3]
if(d>0){
print("roots are real.")
r1=(-coef[2]+sqrt(d))/(2*coef[1])
r2=(-coef[2]-sqrt(d))/(2*coef[1])
return(c(r1,r2))
}
else if(d<0){
print("roots are imaginary.")
r1=complex(real = -coef[2]/(2*coef[1]), imaginary = sqrt(-d)/(2*coef[1]))
r2=complex(real = -coef[2]/(2*coef[1]), imaginary = -sqrt(-d)/(2*coef[1]))
return(c(r1, r2))
}
else{
print("roots are equal")
r = -coef[2]/(2*coef[1])
return(r)
}
}
user.root.check()