Data Analytics with R

This course is designed for the 3rd Semester MBA students of the School of Management, Techno India Group. The lectures  is totally focused on the data analytics in R Software. The first 2 to 3 classes will be helpful to all of the students for the development of the basic understanding with the R software. Then the following next 3 classes will give you the lesson on the data handling portion and the associated statistical analysis. 

Acknowledgement:

I would like to thank my supervisor Prof. Sabyasachi Bhattacharya, and the dean of school of management of Techno India, Prof. Amit Kundu for giving such an opportunity to take these classes. 

Introduction with R software:

It was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from  http://www.r-project.org/  or you can download from R software . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.

In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs. 

References:

Day 1 : 16th January, 2021.  (12.30 p.m. - 1.30 p.m.)

1. Introduction in R Programming: 

Before diving into the core statistical analysis of the data sets, we should be accustomed with the R software a little bit. The following lecture will help you to establish a basic understanding with the R software. 

x = 2 ### storing a particular value

##### To run any code use "Ctrl + Enter" togetherly from your key board.

print(x) #### Print that value

x

y = 7

print(y)

y

###### Some Fundamental operations ###


z1 = x+y ### Addition of two numbers

z1

z2 = x-y ### Subtraction of two numbers

z2

z3 = x*y; z3 ### Multiplication of two numbers

z4 = x/y; z4 #### Division of two numbers


##### Storing of more than one numbers and perform the same operation as mentioned above


### Vector creation


x1 = c(3, 5, 2, 6) ##### 4 numbers are stored here and c stands for concatenation

length(x1) ##### Check that how many numbers are present there

y1 = c(10, 5, 3, 9)

length(y1)


#### Repeat the same process


a = x1 + y1; a

e = sum(x1, y1); e #### Using R command we can also compute the summation

b = x1 - y1; b

c = x1*y1; c

d = x1/y1; d


#### Sequence of number generation


#### There are two ways of number generation.

##### The first method can be applied if the common difference

##### between the numbers are only 1 and that of

###### second method is applicable for all possible of common difference.


x1 = 1:10; x1 ### 1st 10 natural numbers will generate here

x2 = seq(from = 1, to = 10, by = 3) ### Note that the common difference between the numbers be 3


#### Length wise number generation


x3 = seq(1, 50, length.out = 30) #### Some of the appeared numbers should be fraction

Day 2 : 19th January, 2021. (9.00 a.m. - 11.00 a.m.)

Today we divide the two hour class into two sessions. The first half will be on the continuation of basic courses i.e. the introduction portion and the second one is one the Graphical course in R software. The following materials is also diving in the same way. Schedule

2. Continuation on the "Introduction Portion"  

#### The shortcut command for opening the R script through Keyboard is the

### Ctrl+Shift+N


### Use of some in buit functions ###


x = -10

abs(x) ### Returns the absolute value


y = 1.02365 

round(y) ### Only giving the rounding off values  

round(y, digits= 3)


z = log(1)

z

q = log(2, base = 10)

q

p = log(10, base = 10)

p

print(p)


print("The line is required") ### Line will be appeared under double quotes

cat("The line is required") ### Line will be not appeared under double quotes


### Repeating any number 

a = rep(2, 10)

length(a)

b = rep(0, 10)

b

c = numeric(15) ### 15 zeros will be appeared

c

length(c)

rep(NA, 5)


d = seq(1, 10)

d

head(d, 3)

tail(d, 2)

tail(d, 5)


e = seq(1, 10, by = 3)

e

f = seq(from = 1, to = 10, length.out = 5)

f


###### 


x = c(5, 7.3, 8, 9.2, 6.3, 4, 10)

x

sum(x)

x^2

max(x)

min(x)

mean(x)

var(x)

sd(x)

s = sort(x)

length(x)

median.position = (length(x)+1)/2

median.position

median = s[median.position]


####


x1 = c(-5, 6, -9, 2)

x1>0

x1<0

a1 = which(x1>0)

length(a1)

b1 = which(x1<0)

b1

length(b1)

### Our objective is to print only the positive numbers


a1 = which(x1>0)

a1 ## The positions where the positive numbers are located

x1[a1] ## It returns you the positive numbers. 


### Our objective is to print only the negative numbers


b1 = which(x1<0)

b1 ## The positions where the negative numbers are located

x1[b1] ## It returns you the negative numbers. 



3. Graphix portion in the R software:

x = c(160, 162, 163, 165, 168, 170)

x

plot(x, type = "p") ### Scatter Plot 

plot(x, type = "l") ### Line Plot

plot(x, type = "b") ### Line and Point together in a Plot


plot(x, type = "b", xlab = "Number of Persons", ylab = "Height")

plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",

     main = "Height of Six individuals")


plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",

     sub = "Height of Six individuals")


plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",

     main = "Height of Six individuals")

plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",

     main = "Height of Six individuals", col = "red", lwd = 3, lty = 3)

plot(x, xlim = c(0, 4), ylim = c(160, 165), cex.lab = 1)

plot(x, type = "b", xlab = "Number of Persons", ylab = "Height",

     main = "Height of Six individuals", col = "red", lwd = 3, lty = 3, cex.lab = 2)

plot(x, type = "l")

legend("topleft", legend = "Height of persons", lwd = 1, bty = "n")


#### Inserting Another Graph 


y = seq(160, 165, length.out = 6)

lines(y, col = "red", lwd = 2, lty = 2)


plot(x, y, type = "b", col = "red", lwd = 2)


### Find the differences on the barplot and the histogram

x

barplot(x) ### Command for Barplot

hist(x, probability = F) ## Command for Histogram


#### Normal distributed data collection 


n = rnorm(10, mean = 1, sd = 0.5)

n

number.of.sample.size = length(n)

hist(n)

mean(n)

std.error = sd(n)/sqrt(number.of.sample.size)

std.error

sd(n)

Day 3 : 28th January 2021 (11.30 a.m. - 2.30 p.m. )

We bifurcate the 3 hour class into two sessions. The first session is organized based on the construction of the matrix in R software. As we all know that any kind of data set should consist certain number of rows and columns i.e. a structure of Matrix. So it is very important to deal with the matrices in R software. The following codes will demonstrate the construction of matrix in R software.  

Session 1: (11.30 a.m. - 1 p.m. )

####### Matrix formation using R software

#### In R software matrix can be formed in 4 ways :

## 1. Binding process --- (i) Row bind; (ii) Column bind

## 2. Making a data frame ----- Use the command data.matrix to convert it into matrix

## ******** 3. Using Matrix command

## 4. Using diag command

#### Part 1 -- Binding process

### First create a vector


group1 = c(175, 165, 185, 195)

group2 = c(150, 160, 165, 175)

group3 = c(150, 160, 165, 175)


### Now, we will execute the first binding process i.e. row binding


Height.data.1 = rbind(group1, group2, group3); Height.data.1 ### Output is the matrix having separate ##rownames


#### The following command provides seperate column names so that

#### it will be very easy to understand the data structure


colnames(Height.data.1) = c("individual 1", "individual 2", "individual 3", "individual 4"); Height.data.1


##### Now, we will execute the 2nd binding process


Height.data.2 = cbind(group1, group2, group3); Height.data.2


### The following command provides the seperate row names


rownames(Height.data.2) = c(" individual 1", "individual 2", "individual 3", "individual 4"); Height.data.2


####### Now, coming to the second part of matrix formation


data = data.frame(group1, group2, group3); data


summary(data)


head(data, 2) ### printing the head values


complete.cases(data) ### printing the missing values


data = data.frame(group1, group2, group3); data


### Look at the output, beside giving the Height of the groups, it will also provide a serial number

### This is the specialty of the data frame. That means if you have any discrepancy about any data

### You can easily find it.


### Now, we will convert the data frame into matrix


new.data = data.matrix(data); new.data


### Again we can insert the row names here


rownames(new.data) = c(" individual 1", "individual 2", "individual 3", "individual 4"); new.data


#### Now, we can use the most important tool of R software in order to construct any matrix


M1 = matrix(c(group1, group2, group3), nrow = 4, byrow = F); M1


M2 = matrix(c(group1, group2, group3), nrow = 4, ncol = 3, byrow = T); M2



## Finally we consider the following matrix


M = matrix(c(group1, group2, group3), nrow = 3, byrow = T); M


### Some basic operations on matrix

## Transpose of the matrix


M.transpose = t(M); M.transpose


### Extracting rows and columns from any matrix

### R software denotes the rows of any matrix can by M[i,] and that of column by M[,j]


First.group = M[1,]; First.group


Mid.sem.1 = M[,1]; Mid.sem.1


##### method of constructing the diagonal matrix


M.diagonal = diag(c(1, 2, 3, 4, 5)); M.diagonal

M.diagonal.1 = diag(group1); M.diagonal.1



##### Addition, Subtraction, Multiplication, Determinant, Inverse computation


A = matrix(data = c(10, 122, 113, 515, 120, 100, 24, 25), nrow = 4, ncol = 4); A


B = matrix(data = seq(2, 9), by = 1, nrow = 4, ncol = 4); B


sum = A + B; sum


sub = A - B; sub


component.multilplication = A*B; component.multilplication


multiplication = A%*%B; multiplication

Day 3 : 28th January 2021 (11.30 a.m. - 2.30 p.m. )

The next session is developed for the regression analysis with the real life data set. We use this "Data" for today's discussion. So, download this data first and open in excel format. 

Session 2: (1.10 p.m. - 2.30 p.m. )

# 1. First Copy the path directory, then placed it under the command setwd(".")

# 2. Note that, the front slash should be replaced by backslash.

# 3. Then save the excel file data in .csv(Comma delimited) format.

setwd("C:/Users/USER/Dropbox/Techno_India")

getwd()

data = read.csv("data_set_1.csv") 

data

View(data) ### Excel like view

summary(data) ## To infer about the fundamental statistical 

#measures of your dataset


m = data$Start.Up.Skills ## Extraction of column from data

length(m)


## There exist another process which is given below

## The matrix element can be written as M[i,j]; i = Row number; j = Column number


entrepenureship_rate = data[1:200,1] 

entrepenureship_rate

Start.Up.Skills = data[1:200,2]

Start.Up.Skills

plot(Start.Up.Skills,entrepenureship_rate)

## Data frame construction 

new.data = data.frame(Start.Up.Skills, entrepenureship_rate)


### Linear model fit 

## Any linear equation can be written as y = a*x+b 

## The linear model fit can be executed by the commnad "lm"

fit.linear = lm(entrepenureship_rate~Start.Up.Skills)

summary(fit.linear)


### Non-linear model fit 

## There exist several non-linear model in the literature 

# But I use y = c*exp(-x)

## The non-linear model fitting can be executed by the commnad "nls"


fit.non.linear = nls(entrepenureship_rate~c*exp(-Start.Up.Skills), data = new.data

  , start = list(c = 1))

summary(fit.non.linear)

## Coeeficient extraction 

coef(fit.linear)[1] 

## See the difference between the above and following two commands

a = as.numeric(coef(fit.linear)[1]); a

b = as.numeric(coef(fit.linear)[2]); b


## Sorting the explanatory variable

new.skill = sort(Start.Up.Skills)

predicted.linear = a + b*new.skill ## Enumerate the predicted variables 


## Attaching the predicted graph on the scatter diagram

lines(new.skill, predicted.linear, col = "green", lwd = 2)


c = as.numeric(coef(fit.non.linear)[1]); c

predicted.non.linear = c*exp(-new.skill)

lines(new.skill, predicted.non.linear, col = "red")


## The statistical measure based on which you can judge whether 

## your model is performing better or not. 

## Full form of AIC is Akaike Information Criterion. 

## Lower the value of AIC, better will be the model.

AIC(fit.linear)

AIC(fit.non.linear)


####### Correlation #######


x = 1:100

y = 101:200

plot(x, y)

cor(x, y) ## Command to find the correlation

cor.test(x, y) ## To test whether the correlations are really significant or not

cor.test(Start.Up.Skills, entrepenureship_rate)

Day 3 : 8th February 2021 (11.30 a.m. - 2.30 p.m. )

Announcement:

Indian Statistical Institute, Kolkata organize a workshop on the software SPSS in a regular manner. Please go through the link "https://www.isical.ac.in/node/2666" for joining this year. It is a 7 days workshop, which will be organized in physical mode. 

Like the previous class we again bifurcate this class into two halves. The first session will be continued on the regression and the next session will be on the time series analysis. So, for the first session download the data of "Vehicle". 

Session 1: (11.30 a.m. -  1.05 p.m.)

vehicle = read.csv("C:/Users/USER/Dropbox/Tuition/Techno_India/Time_Series_Analysis/vehicle.csv")

head(vehicle)

str(vehicle)

summary(vehicle)


vehicle$lh[vehicle$lh == 0] = NA

vehicle$lc[vehicle$lc == 0] = NA

summary(vehicle)

pairs(vehicle[3:5])

cor(vehicle[3:5])



vehicle = read.csv("C:/Users/USER/Dropbox/Tuition/Techno_India/Time_Series_Analysis/vehicle.csv")

head(vehicle)

str(vehicle)

summary(vehicle)


vehicle$lh[vehicle$lh == 0] = mean(vehicle$lh)

vehicle$lc[vehicle$lc == 0] = mean(vehicle$lc)

summary(vehicle)

pairs(vehicle[3:5])

cor(vehicle[3:5])


### Random Sampling ####


x = seq(1, 10); x

sample(x, size = 3, replace = T)


## Data Partition

#set.seed(1234)


ind = sample(2, nrow(vehicle), replace = T, prob = c(0.7, 0.3))

training = vehicle[ind == 1,]

testing = vehicle[ind == 2, ]

head(training)

head(testing)


cbind(summary(training$lc), summary(testing$lc))


### Multiple linear regression 

model = lm(lc~lh + Mileage, data = training)

model

summary(model)


plot(lc~lh, training)

abline(model, col = "blue")


## Model Diagonsatics


par(mfrow = c(2,2))

plot(model)

vehicle[1620,]


# Prediction

pred = predict(model, testing)

head(pred)

head(testing)


## Outlier detection ##

boxplot(testing$lc)


## What is Multicollinearity ## 

# 1. Occurs when there is moderate to high correlations 

#among the independent variables


# 2. Leads to Unstable estimates.


# 3. Variance Inflation Factor (VIF) > 

#10 indicates presence of multicollinearity



## Variance Inflation Factor (VIF)


str(training)

pairs(training[2:6])

newmodel = lm(lc~lh+Mileage+mc+fm, data= training)

library(faraway)

vif(newmodel)

Day 3 : 8th February 2021 (11.30 a.m. - 2.30 p.m. )

Session 2: (1.30 -  2.30 p.m.)

We will discuss about the time series analysis in this session. The time series data consists of three components i.e (i) Trend, (ii) Seasonality, (iii) Random fluctuations. First we will identify this three components into the data of AirPassengers. The data is inbuilt in the R software. Then we perform the time series forecast analysis with the help of this data. The following R code will demonstrate these issues.  

## Time series data

data("AirPassengers")

AP = AirPassengers

str(AP)

View(AP)

head(AP)

ts(AP, frequency = 12, start = c(1949, 1)) ## command for time series

# frequency stands for how many months

# start indicates the starting year with the starting month 

plot(AP)

AP = log(AP)

plot(AP)

decomp = decompose(AP)

plot(decomp$figure, type = "b", 

     xlab = "Month", ylab = "Seasonality Index", col = "blue", las = 2) 

plot(decomp)


############ Forecasting #########

library(forecast)

model = auto.arima(AP)

model

acf(model$residuals, main = "Correlogram")

pacf(model$residuals, main = "Partial Correlogram")


hist(model$residuals, col ='red', xlab = 'Error', main = 'Histogram of residuals', freq = F)

lines(density(model$residuals))


##### Forecast


f = forecast(model, 48)

plot(f)

accuracy(f)

Day 4 : 11th February 2021 (11.30 a.m. - 2.30 p.m. )

Session 1: (11.30 a.m. -  1.10 p.m.)

 Like the previous two classes we are again going to bifurcate this class into halves. The first session will be the hands on training on the "Cluster Analysis". The Class will then continued to the second session, where the topic of discussion will be the Principal Component Analysis, in short PCA. The first session will be demonstrated on a data set named "Utilities". Download this data set before starting the session.  

# Cluster Analysis

mydata <- read.csv(file.choose(), header = T)

str(mydata)

head(mydata)

pairs(mydata)


# Scatter plot 

plot(mydata$Fuel_Cost~ mydata$Sales, data = mydata)

with(mydata,text(mydata$Fuel_Cost ~ mydata$Sales, labels=mydata$Company,pos=4, cex = 0.6))


# Normalize 

z = mydata[,-c(1,1)]

means = apply(z,2,mean)

sds = apply(z,2,sd)

nor = scale(z,center=means,scale=sds)


##calculate distance matrix (default is Euclidean distance)

distance = dist(nor)

distance

# Hierarchical agglomerative clustering using default complete linkage 

mydata.hclust = hclust(distance)

plot(mydata.hclust)

plot(mydata.hclust,labels=mydata$Company,main='Default from hclust')

plot(mydata.hclust,labels=mydata$Company,hang=-1)


# Hierarchical agglomerative clustering using "average" linkage 

mydata.hclust.average<-hclust(distance,method="average")

plot(mydata.hclust.average,labels=mydata$Company, hang=-1)


# Cluster membership

member.c = cutree(mydata.hclust,3)

member.a = cutree(mydata.hclust.average,3)

table(member.c, member.a)


#Characterizing clusters 

aggregate(nor,list(member.a),mean)

aggregate(nor,list(member.c),mean)


aggregate(mydata[,-c(1,1)],list(member.c),mean)


# Scree Plot

wss <- (nrow(nor)-1)*sum(apply(nor,2,var))

for (i in 2:20) wss[i] <- sum(kmeans(nor, centers=i)$withinss)

plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") 


# K-means clustering

kc<-kmeans(nor,3)

kc

library(cluster)

clusplot(nor, kc$cluster, shade = T, labels = 2, lines = 0, main = "k-Means Cluster Analysis")

Day 4 : 11th February 2021 (11.30 a.m. - 2.30 p.m. )

Session 2: (1.10  -  2.30 p.m.)

 In this session we will learn about the Principal Component Analysis (PCA). As we all know, PCA is used to reduce the dimesion of the data set. Check the following code for the illustration purpose

## Data 

data("iris")

str(iris)

summary(iris)


## Partition Data

ind = sample(2, nrow(iris), replace = T, prob = c(0.8, 0.2))

training = iris[ind == 1,]

testing = iris[ind == 2,]


##Scatter plot and Correlations 

library(psych)

pairs.panels(training[,-5], gap = 0, bg = c("red", "yellow", "blue")[training$Species], pch = 21)


### Prinicpal Component Analysis

pc = prcomp(training[,-5], center= T, scale. = T)

#pc = pca(training[,-5], center= T, scale. = T)


attributes(pc)

pc$center

mean(training$Sepal.Length)

pc$scale

sd(training$Sepal.Length)

print(pc)


## Rotation is called loading score


summary(pc)


## Orthogonality of PCs

pairs.panels(pc$x, gap = 0, 

             bg = c("red", "yellow", "blue")[training$Species],

             pch = 21)


# Bi-plot

library(devtools)

library(ggplot2)

#install_github("vqv/ggbiplot")

#library(factoextra)

#biplot(pc)

library(ggfortify)

screeplot(pc, type = "l", main = "Scree Plot/Elbow Plot")

autoplot(pc, data = training, colour = 'Species',

         loadings = TRUE, loadings.colour = 'blue',

         loadings.label = TRUE, loadings.label.size = 3)


## Prediction with Principal Components

trg = predict(pc, training)

trg

trg = data.frame(trg, training[5])


tst = predict(pc, testing)

tst

tst = data.frame(tst, testing$Species)