Data Analytics with R
This course is designed for the 3rd Semester MBA students of the School of Management, Techno India Group. The lectures is totally focused on the data analytics in R Software. The first 2 to 3 classes will be helpful to all of the students for the development of the basic understanding with the R software. Then the following next 3 classes will give you the lesson on the data handling portion and the associated statistical analysis.
Acknowledgement:
I would like to thank my supervisor Prof. Sabyasachi Bhattacharya, and the dean of school of management of Techno India, Prof. Amit Kundu for giving such an opportunity to take these classes.
Introduction with R software:
It was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from http://www.r-project.org/ or you can download from R software . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.
In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs.
References:
An introduction to R, Longhow Lam. (Link for the material https://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf )
Applied Statistical Inference: Likelihood and Bayes by Leohard Held and Daniel Sabanes Bove, Springer-Verlag Berlin 2014
The R Student Companion, Brian Dennis, CRC Press, 2013.
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie and Tibshirani, Springer Text in Statistics 2013
Using R for Numerical Analysis in Science and Engineering by Victor A. Broomfield, CRC Press. Taylor and Francis Group 2014
A Primer of Ecology with R by M. Henry and H. Stevens, Springer 2009
Statistical Modeling: The Two Cultures by Leo Breiman, Statistical Science 2001, Vol. 16, No. 3, 199-231.
The Art of R Programming; Norman Matloff
AN R COMPANION FOR THE HANDBOOK OF BIOLOGICAL STATISTICS; SALVATORE S. MANGIAFICO : https://rcompanion.org/documents/RCompanionBioStatistics.pdf
Day 1 : 16th January, 2021. (12.30 p.m. - 1.30 p.m.)
1. Introduction in R Programming:
Before diving into the core statistical analysis of the data sets, we should be accustomed with the R software a little bit. The following lecture will help you to establish a basic understanding with the R software.
x = 2 ### storing a particular value
##### To run any code use "Ctrl + Enter" togetherly from your key board.
print(x) #### Print that value
x
y = 7
print(y)
y
###### Some Fundamental operations ###
z1 = x+y ### Addition of two numbers
z1
z2 = x-y ### Subtraction of two numbers
z2
z3 = x*y; z3 ### Multiplication of two numbers
z4 = x/y; z4 #### Division of two numbers
##### Storing of more than one numbers and perform the same operation as mentioned above
### Vector creation
x1 = c(3, 5, 2, 6) ##### 4 numbers are stored here and c stands for concatenation
length(x1) ##### Check that how many numbers are present there
y1 = c(10, 5, 3, 9)
length(y1)
#### Repeat the same process
a = x1 + y1; a
e = sum(x1, y1); e #### Using R command we can also compute the summation
b = x1 - y1; b
c = x1*y1; c
d = x1/y1; d
#### Sequence of number generation
#### There are two ways of number generation.
##### The first method can be applied if the common difference
##### between the numbers are only 1 and that of
###### second method is applicable for all possible of common difference.
x1 = 1:10; x1 ### 1st 10 natural numbers will generate here
x2 = seq(from = 1, to = 10, by = 3) ### Note that the common difference between the numbers be 3
#### Length wise number generation
x3 = seq(1, 50, length.out = 30) #### Some of the appeared numbers should be fraction
Day 2 : 19th January, 2021. (9.00 a.m. - 11.00 a.m.)
Today we divide the two hour class into two sessions. The first half will be on the continuation of basic courses i.e. the introduction portion and the second one is one the Graphical course in R software. The following materials is also diving in the same way. Schedule
2. Continuation on the "Introduction Portion"
#### The shortcut command for opening the R script through Keyboard is the
### Ctrl+Shift+N
### Use of some in buit functions ###
x = -10
abs(x) ### Returns the absolute value
y = 1.02365
round(y) ### Only giving the rounding off values
round(y, digits= 3)
z = log(1)
z
q = log(2, base = 10)
q
p = log(10, base = 10)
p
print(p)
print("The line is required") ### Line will be appeared under double quotes
cat("The line is required") ### Line will be not appeared under double quotes
### Repeating any number
a = rep(2, 10)
length(a)
b = rep(0, 10)
b
c = numeric(15) ### 15 zeros will be appeared
c
length(c)
rep(NA, 5)
d = seq(1, 10)
d
head(d, 3)
tail(d, 2)
tail(d, 5)
e = seq(1, 10, by = 3)
e
f = seq(from = 1, to = 10, length.out = 5)
f
######
x = c(5, 7.3, 8, 9.2, 6.3, 4, 10)
x
sum(x)
x^2
max(x)
min(x)
mean(x)
var(x)
sd(x)
s = sort(x)
length(x)
median.position = (length(x)+1)/2
median.position
median = s[median.position]
####
x1 = c(-5, 6, -9, 2)
x1>0
x1<0
a1 = which(x1>0)
length(a1)
b1 = which(x1<0)
b1
length(b1)
### Our objective is to print only the positive numbers
a1 = which(x1>0)
a1 ## The positions where the positive numbers are located
x1[a1] ## It returns you the positive numbers.
### Our objective is to print only the negative numbers
b1 = which(x1<0)
b1 ## The positions where the negative numbers are located
x1[b1] ## It returns you the negative numbers.
3. Graphix portion in the R software:
x = c(160, 162, 163, 165, 168, 170)
x
plot(x, type = "p") ### Scatter Plot
plot(x, type = "l") ### Line Plot
plot(x, type = "b") ### Line and Point together in a Plot
plot(x, type = "b", xlab = "Number of Persons", ylab = "Height")
plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",
main = "Height of Six individuals")
plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",
sub = "Height of Six individuals")
plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",
main = "Height of Six individuals")
plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",
main = "Height of Six individuals", col = "red", lwd = 3, lty = 3)
plot(x, xlim = c(0, 4), ylim = c(160, 165), cex.lab = 1)
plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",
main = "Height of Six individuals", col = "red", lwd = 3, lty = 3, cex.lab = 2)
plot(x, type = "l")
legend("topleft", legend = "Height of persons", lwd = 1, bty = "n")
#### Inserting Another Graph
y = seq(160, 165, length.out = 6)
lines(y, col = "red", lwd = 2, lty = 2)
plot(x, y, type = "b", col = "red", lwd = 2)
### Find the differences on the barplot and the histogram
x
barplot(x) ### Command for Barplot
hist(x, probability = F) ## Command for Histogram
#### Normal distributed data collection
n = rnorm(10, mean = 1, sd = 0.5)
n
number.of.sample.size = length(n)
hist(n)
mean(n)
std.error = sd(n)/sqrt(number.of.sample.size)
std.error
sd(n)
Day 3 : 28th January 2021 (11.30 a.m. - 2.30 p.m. )
We bifurcate the 3 hour class into two sessions. The first session is organized based on the construction of the matrix in R software. As we all know that any kind of data set should consist certain number of rows and columns i.e. a structure of Matrix. So it is very important to deal with the matrices in R software. The following codes will demonstrate the construction of matrix in R software.
Session 1: (11.30 a.m. - 1 p.m. )
####### Matrix formation using R software
#### In R software matrix can be formed in 4 ways :
## 1. Binding process --- (i) Row bind; (ii) Column bind
## 2. Making a data frame ----- Use the command data.matrix to convert it into matrix
## ******** 3. Using Matrix command
## 4. Using diag command
#### Part 1 -- Binding process
### First create a vector
group1 = c(175, 165, 185, 195)
group2 = c(150, 160, 165, 175)
group3 = c(150, 160, 165, 175)
### Now, we will execute the first binding process i.e. row binding
Height.data.1 = rbind(group1, group2, group3); Height.data.1 ### Output is the matrix having separate ##rownames
#### The following command provides seperate column names so that
#### it will be very easy to understand the data structure
colnames(Height.data.1) = c("individual 1", "individual 2", "individual 3", "individual 4"); Height.data.1
##### Now, we will execute the 2nd binding process
Height.data.2 = cbind(group1, group2, group3); Height.data.2
### The following command provides the seperate row names
rownames(Height.data.2) = c(" individual 1", "individual 2", "individual 3", "individual 4"); Height.data.2
####### Now, coming to the second part of matrix formation
data = data.frame(group1, group2, group3); data
summary(data)
head(data, 2) ### printing the head values
complete.cases(data) ### printing the missing values
data = data.frame(group1, group2, group3); data
### Look at the output, beside giving the Height of the groups, it will also provide a serial number
### This is the specialty of the data frame. That means if you have any discrepancy about any data
### You can easily find it.
### Now, we will convert the data frame into matrix
new.data = data.matrix(data); new.data
### Again we can insert the row names here
rownames(new.data) = c(" individual 1", "individual 2", "individual 3", "individual 4"); new.data
#### Now, we can use the most important tool of R software in order to construct any matrix
M1 = matrix(c(group1, group2, group3), nrow = 4, byrow = F); M1
M2 = matrix(c(group1, group2, group3), nrow = 4, ncol = 3, byrow = T); M2
## Finally we consider the following matrix
M = matrix(c(group1, group2, group3), nrow = 3, byrow = T); M
### Some basic operations on matrix
## Transpose of the matrix
M.transpose = t(M); M.transpose
### Extracting rows and columns from any matrix
### R software denotes the rows of any matrix can by M[i,] and that of column by M[,j]
First.group = M[1,]; First.group
Mid.sem.1 = M[,1]; Mid.sem.1
##### method of constructing the diagonal matrix
M.diagonal = diag(c(1, 2, 3, 4, 5)); M.diagonal
M.diagonal.1 = diag(group1); M.diagonal.1
##### Addition, Subtraction, Multiplication, Determinant, Inverse computation
A = matrix(data = c(10, 122, 113, 515, 120, 100, 24, 25), nrow = 4, ncol = 4); A
B = matrix(data = seq(2, 9), by = 1, nrow = 4, ncol = 4); B
sum = A + B; sum
sub = A - B; sub
component.multilplication = A*B; component.multilplication
multiplication = A%*%B; multiplication
Day 3 : 28th January 2021 (11.30 a.m. - 2.30 p.m. )
The next session is developed for the regression analysis with the real life data set. We use this "Data" for today's discussion. So, download this data first and open in excel format.
Session 2: (1.10 p.m. - 2.30 p.m. )
# 1. First Copy the path directory, then placed it under the command setwd(".")
# 2. Note that, the front slash should be replaced by backslash.
# 3. Then save the excel file data in .csv(Comma delimited) format.
setwd("C:/Users/USER/Dropbox/Techno_India")
getwd()
data = read.csv("data_set_1.csv")
data
View(data) ### Excel like view
summary(data) ## To infer about the fundamental statistical
#measures of your dataset
m = data$Start.Up.Skills ## Extraction of column from data
length(m)
## There exist another process which is given below
## The matrix element can be written as M[i,j]; i = Row number; j = Column number
entrepenureship_rate = data[1:200,1]
entrepenureship_rate
Start.Up.Skills = data[1:200,2]
Start.Up.Skills
plot(Start.Up.Skills,entrepenureship_rate)
## Data frame construction
new.data = data.frame(Start.Up.Skills, entrepenureship_rate)
### Linear model fit
## Any linear equation can be written as y = a*x+b
## The linear model fit can be executed by the commnad "lm"
fit.linear = lm(entrepenureship_rate~Start.Up.Skills)
summary(fit.linear)
### Non-linear model fit
## There exist several non-linear model in the literature
# But I use y = c*exp(-x)
## The non-linear model fitting can be executed by the commnad "nls"
fit.non.linear = nls(entrepenureship_rate~c*exp(-Start.Up.Skills), data = new.data
, start = list(c = 1))
summary(fit.non.linear)
## Coeeficient extraction
coef(fit.linear)[1]
## See the difference between the above and following two commands
a = as.numeric(coef(fit.linear)[1]); a
b = as.numeric(coef(fit.linear)[2]); b
## Sorting the explanatory variable
new.skill = sort(Start.Up.Skills)
predicted.linear = a + b*new.skill ## Enumerate the predicted variables
## Attaching the predicted graph on the scatter diagram
lines(new.skill, predicted.linear, col = "green", lwd = 2)
c = as.numeric(coef(fit.non.linear)[1]); c
predicted.non.linear = c*exp(-new.skill)
lines(new.skill, predicted.non.linear, col = "red")
## The statistical measure based on which you can judge whether
## your model is performing better or not.
## Full form of AIC is Akaike Information Criterion.
## Lower the value of AIC, better will be the model.
AIC(fit.linear)
AIC(fit.non.linear)
####### Correlation #######
x = 1:100
y = 101:200
plot(x, y)
cor(x, y) ## Command to find the correlation
cor.test(x, y) ## To test whether the correlations are really significant or not
cor.test(Start.Up.Skills, entrepenureship_rate)
Day 3 : 8th February 2021 (11.30 a.m. - 2.30 p.m. )
Announcement:
Indian Statistical Institute, Kolkata organize a workshop on the software SPSS in a regular manner. Please go through the link "https://www.isical.ac.in/node/2666" for joining this year. It is a 7 days workshop, which will be organized in physical mode.
Like the previous class we again bifurcate this class into two halves. The first session will be continued on the regression and the next session will be on the time series analysis. So, for the first session download the data of "Vehicle".
Session 1: (11.30 a.m. - 1.05 p.m.)
vehicle = read.csv("C:/Users/USER/Dropbox/Tuition/Techno_India/Time_Series_Analysis/vehicle.csv")
head(vehicle)
str(vehicle)
summary(vehicle)
vehicle$lh[vehicle$lh == 0] = NA
vehicle$lc[vehicle$lc == 0] = NA
summary(vehicle)
pairs(vehicle[3:5])
cor(vehicle[3:5])
vehicle = read.csv("C:/Users/USER/Dropbox/Tuition/Techno_India/Time_Series_Analysis/vehicle.csv")
head(vehicle)
str(vehicle)
summary(vehicle)
vehicle$lh[vehicle$lh == 0] = mean(vehicle$lh)
vehicle$lc[vehicle$lc == 0] = mean(vehicle$lc)
summary(vehicle)
pairs(vehicle[3:5])
cor(vehicle[3:5])
### Random Sampling ####
x = seq(1, 10); x
sample(x, size = 3, replace = T)
## Data Partition
#set.seed(1234)
ind = sample(2, nrow(vehicle), replace = T, prob = c(0.7, 0.3))
training = vehicle[ind == 1,]
testing = vehicle[ind == 2, ]
head(training)
head(testing)
cbind(summary(training$lc), summary(testing$lc))
### Multiple linear regression
model = lm(lc~lh + Mileage, data = training)
model
summary(model)
plot(lc~lh, training)
abline(model, col = "blue")
## Model Diagonsatics
par(mfrow = c(2,2))
plot(model)
vehicle[1620,]
# Prediction
pred = predict(model, testing)
head(pred)
head(testing)
## Outlier detection ##
boxplot(testing$lc)
## What is Multicollinearity ##
# 1. Occurs when there is moderate to high correlations
#among the independent variables
# 2. Leads to Unstable estimates.
# 3. Variance Inflation Factor (VIF) >
#10 indicates presence of multicollinearity
## Variance Inflation Factor (VIF)
str(training)
pairs(training[2:6])
newmodel = lm(lc~lh+Mileage+mc+fm, data= training)
library(faraway)
vif(newmodel)
Day 3 : 8th February 2021 (11.30 a.m. - 2.30 p.m. )
Session 2: (1.30 - 2.30 p.m.)
We will discuss about the time series analysis in this session. The time series data consists of three components i.e (i) Trend, (ii) Seasonality, (iii) Random fluctuations. First we will identify this three components into the data of AirPassengers. The data is inbuilt in the R software. Then we perform the time series forecast analysis with the help of this data. The following R code will demonstrate these issues.
## Time series data
data("AirPassengers")
AP = AirPassengers
str(AP)
View(AP)
head(AP)
ts(AP, frequency = 12, start = c(1949, 1)) ## command for time series
# frequency stands for how many months
# start indicates the starting year with the starting month
plot(AP)
AP = log(AP)
plot(AP)
decomp = decompose(AP)
plot(decomp$figure, type = "b",
xlab = "Month", ylab = "Seasonality Index", col = "blue", las = 2)
plot(decomp)
############ Forecasting #########
library(forecast)
model = auto.arima(AP)
model
acf(model$residuals, main = "Correlogram")
pacf(model$residuals, main = "Partial Correlogram")
hist(model$residuals, col ='red', xlab = 'Error', main = 'Histogram of residuals', freq = F)
lines(density(model$residuals))
##### Forecast
f = forecast(model, 48)
plot(f)
accuracy(f)
Day 4 : 11th February 2021 (11.30 a.m. - 2.30 p.m. )
Session 1: (11.30 a.m. - 1.10 p.m.)
Like the previous two classes we are again going to bifurcate this class into halves. The first session will be the hands on training on the "Cluster Analysis". The Class will then continued to the second session, where the topic of discussion will be the Principal Component Analysis, in short PCA. The first session will be demonstrated on a data set named "Utilities". Download this data set before starting the session.
# Cluster Analysis
mydata <- read.csv(file.choose(), header = T)
str(mydata)
head(mydata)
pairs(mydata)
# Scatter plot
plot(mydata$Fuel_Cost~ mydata$Sales, data = mydata)
with(mydata,text(mydata$Fuel_Cost ~ mydata$Sales, labels=mydata$Company,pos=4, cex = 0.6))
# Normalize
z = mydata[,-c(1,1)]
means = apply(z,2,mean)
sds = apply(z,2,sd)
nor = scale(z,center=means,scale=sds)
##calculate distance matrix (default is Euclidean distance)
distance = dist(nor)
distance
# Hierarchical agglomerative clustering using default complete linkage
mydata.hclust = hclust(distance)
plot(mydata.hclust)
plot(mydata.hclust,labels=mydata$Company,main='Default from hclust')
plot(mydata.hclust,labels=mydata$Company,hang=-1)
# Hierarchical agglomerative clustering using "average" linkage
mydata.hclust.average<-hclust(distance,method="average")
plot(mydata.hclust.average,labels=mydata$Company, hang=-1)
# Cluster membership
member.c = cutree(mydata.hclust,3)
member.a = cutree(mydata.hclust.average,3)
table(member.c, member.a)
#Characterizing clusters
aggregate(nor,list(member.a),mean)
aggregate(nor,list(member.c),mean)
aggregate(mydata[,-c(1,1)],list(member.c),mean)
# Scree Plot
wss <- (nrow(nor)-1)*sum(apply(nor,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(nor, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-means clustering
kc<-kmeans(nor,3)
kc
library(cluster)
clusplot(nor, kc$cluster, shade = T, labels = 2, lines = 0, main = "k-Means Cluster Analysis")
Day 4 : 11th February 2021 (11.30 a.m. - 2.30 p.m. )
Session 2: (1.10 - 2.30 p.m.)
In this session we will learn about the Principal Component Analysis (PCA). As we all know, PCA is used to reduce the dimesion of the data set. Check the following code for the illustration purpose
## Data
data("iris")
str(iris)
summary(iris)
## Partition Data
ind = sample(2, nrow(iris), replace = T, prob = c(0.8, 0.2))
training = iris[ind == 1,]
testing = iris[ind == 2,]
##Scatter plot and Correlations
library(psych)
pairs.panels(training[,-5], gap = 0, bg = c("red", "yellow", "blue")[training$Species], pch = 21)
### Prinicpal Component Analysis
pc = prcomp(training[,-5], center= T, scale. = T)
#pc = pca(training[,-5], center= T, scale. = T)
attributes(pc)
pc$center
mean(training$Sepal.Length)
pc$scale
sd(training$Sepal.Length)
print(pc)
## Rotation is called loading score
summary(pc)
## Orthogonality of PCs
pairs.panels(pc$x, gap = 0,
bg = c("red", "yellow", "blue")[training$Species],
pch = 21)
# Bi-plot
library(devtools)
library(ggplot2)
#install_github("vqv/ggbiplot")
#library(factoextra)
#biplot(pc)
library(ggfortify)
screeplot(pc, type = "l", main = "Scree Plot/Elbow Plot")
autoplot(pc, data = training, colour = 'Species',
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3)
## Prediction with Principal Components
trg = predict(pc, training)
trg
trg = data.frame(trg, training[5])
tst = predict(pc, testing)
tst
tst = data.frame(tst, testing$Species)