Basic Course on R software
This page is mainly designed to implement some basic knowledge about the R software. I am really thankful to Prof. Amita Majumder of Economics Research Unit (ERU), ISI, Kolkata and obviously to my supervisor Prof. Sabyasachi Bhattacharya (Head of Agricultural and Ecological Research Unit, ISI, Kolkata) to give me such an opportunity to share this knowledge with the students of MSQE , ISI, Kolkata.
I am very much grateful to one of my lab mate, Sinchan Ghosh (CSIR-JRF) for helping me in preparing this site.
Now, coming to the point of R software, it was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from http://www.r-project.org/ or you can download from . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.
In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs.
References:
An introduction to R, Longhow Lam. (Link for the material https://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf )
Applied Statistical Inference: Likelihood and Bayes by Leohard Held and Daniel Sabanes Bove, Springer-Verlag Berlin 2014
The R Student Companion, Brian Dennis, CRC Press, 2013.
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie and Tibshirani, Springer Text in Statistics 2013
Using R for Numerical Analysis in Science and Engineering by Victor A. Broomfield, CRC Press. Taylor and Francis Group 2014
A Primer of Ecology with R by M. Henry and H. Stevens, Springer 2009
Statistical Modeling: The Two Cultures by Leo Breiman, Statistical Science 2001, Vol. 16, No. 3, 199-231.
The Art of R Programming; Norman Matloff
AN R COMPANION FOR THE HANDBOOK OF BIOLOGICAL STATISTICS; SALVATORE S. MANGIAFICO : https://rcompanion.org/documents/RCompanionBioStatistics.pdf
Course Materials:
Introduction to R programming
x = 2 ### storing a particular value
##### To run any code use "Ctrl + Enter" togetherly from your key board.
print(x) #### Print that value
x
y = 7
print(y)
y
###### Some Fundamental operations ###
z1 = x+y ### Addition of two numbers
z1
z2 = x-y ### Subtraction of two numbers
z2
z3 = x*y; z3 ### Multiplication of two numbers
z4 = x/y; z4 #### Division of two numbers
##### Storing of more than one numbers and perform the same operation as mentioned above
### Vector creation
x1 = c(3, 5, 2, 6) ##### 4 numbers are stored here and c stands for concatenation
length(x1) ##### Check that how many numbers are present there
y1 = c(10, 5, 3, 9)
length(y1)
#### Repeat the same process
a = x1 + y1; a
e = sum(x1, y1); e #### Using R command we can also compute the summation
b = x1 - y1; b
c = x1*y1; c
d = x1/y1; d
#### Sequence of number generation
#### There are two ways of number generation.
##### The first method can be applied if the common difference
##### between the numbers are only 1 and that of
###### second method is applicable for all possible of common difference.
x1 = 1:10; x1 ### 1st 10 natural numbers will generate here
x2 = seq(from = 1, to = 10, by = 3) ### Note that the common difference between the numbers be 3
#### Length wise number generation
x3 = seq(1, 50, length.out = 30) #### Some of the appeared numbers should be fraction
###### Use of some inbuilt functions
x = -10
abs(x) ## Returns the modulus i.e. the magnitude of the number stored under this variable
y = 1.02365
round(y)
round(y, digits = 2) #### Returns a number where 2 digits will appear after the decimal point
z = log(1); z #### Verify the value of log 1, whether it is zero or not
p = log(exp(1)); p
q = log(2, base = 10); q #### Here the base value is different
log10(2)
cat("The above two values are same")
numeric(10) ### Returns the value zero
rep(2, 30) ### Repeat the number 2, 30 times
rep(NA, 20) ### Repeat the logical variable NA 20 times
m = seq(1, 50, by = 1)
head(m) ##### Printing first six outputs
head(m, n = 7L) ##### Printing the first 7 outputs
tail(m) ##### Printing last six outputs
### Revisiting the basic operations
x=c(5, 7.3, 8, 9.2, 6.3, 4, 10)
sum(x)
min(x)
max(x)
cumsum(x)
cumprod(x)
which.max(x)
which.min(x)
mean(x)
sd(x)
x>5
sum(x>5) #number of numbers which are greater then 5
rev(x)
length(x)
sort(x, decreasing = TRUE ); help(sort)
x[x>5] #gives the numbers which are greater then 5
sum(x[x>5])
all(x>0)
any(x>0)
any(x<0)
### Finding the position of number
m = c(1, 1, 5, 6, 9.3, 86, 20, 6) ### First form a vector
length(m)
which(m == 5) ###### This command is used to find the position of any number
length(which(m == 1)) #### Return the value of total number positions where the number is located
###### Identification of the type of value stored
## There are 4 types of variables available in the R software
## 1. Integer
## 2. Numeric
## 3. Character
## 4. Logical
class(3)
class(3L)
class("3.2")
class(NA)
#### Conversion of decimal number into integer
as.integer(5.9) #### Returns the value of only integer part
Matrix formation:
We all know that Matrix is nothing but the array of vectors. In order to analyse any kind of data sets we have to convert it in a matrix format or sometimes we collect the data time by time. Then , we have to club together to form a matrix. So matrix always play a big role in case of statistical analysis of any data sets. In the following i will give you a brief demonstration about the matrix formation in R software.
####### Matrix formation using R software
#### In R software matrix can be formed in 4 ways :
## 1. Binding process --- (i) Row bind; (ii) Column bind
## 2. Making a data frame ----- Use the command data.matrix to convert it into matrix
## ******** 3. Using Matrix command
## 4. Using diag command
#### Part 1 -- Binding process
### First create a vector
student1 = c(75, 65, 85, 95)
student2 = c(50, 60, 65, 75)
student3 = c(50, 60, 65, 75)
### Now, we will execute the first binding process i.e. row binding
Marks.data.1 = rbind(student1, student2, student3); Marks.data.1 ### Output is the matrix having separate ##rownames
#### The following command provides seperate column names so that
#### it will be very easy to understand the data structure
colnames(Marks.data.1) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); Marks.data.1
##### Now, we will execute the 2nd binding process
Marks.data.2 = cbind(student1, student2, student3); Marks.data.2
### The following command provides the seperate row names
rownames(Marks.data.2) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); Marks.data.2
####### Now, coming to the second part of matrix formation
data = data.frame(student1, student2, student3); data
summary(data)
head(data, 2) ### printing the head values
complete.cases(data) ### printing the missing values
data = data.frame(student1, student2, student3); data
### Look at the output, beside giving the marks of the students, it will also provide a serial number
### This is the specialty of the data frame. That means if you have any discrepancy about any data
### You can easily find it.
### Now, we will convert the data frame into matrix
new.data = data.matrix(data); new.data
### Again we can insert the row names here
rownames(new.data) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); new.data
#### Now, we can use the most important tool of R software in order to construct any matrix
M1 = matrix(c(student1, student2, student3), nrow = 4, byrow = F); M1
M2 = matrix(c(student1, student2, student3), nrow = 4, ncol = 3, byrow = T); M2
### Try to understand the difference between these two outputs
### Difference between the data frames and matrix
x = student1
A = Marks.data.1
list = list(x, A, data); list
list[[1]]
newlist=list(x, A, data, list); newlist
newlist[[4]]
x=c(1,2,3,NA,3,5,8,NA,10,2) #Treating Missing Values
is.na(x) ### Returns the logical variable
!is.na(x)
sum(is.na(x)) # Number of missing values in the column
sum(!is.na(x)) # Number of non-missing values in the column
## Finally we consider the following matrix
M = matrix(c(student1, student2, student3), nrow = 3, byrow = T); M
### Some basic operations on matrix
## Transpose of the matrix
M.transpose = t(M); M.transpose
### Extracting rows and columns from any matrix
### R software denotes the rows of any matrix can by M[i,] and that of column by M[,j]
First.student = M[1,]; First.student
Mid.sem.1 = M[,1]; Mid.sem.1
##### method of constructing the diagonal matrix
M.diagonal = diag(c(1, 2, 3, 4, 5)); M.diagonal
##### Addition, Subtraction, Multiplication, Determinant, Inverse computation
A = matrix(data = c(10, 122, 113, 515, 120, 100, 24, 25), nrow = 4, ncol = 4); A
B = matrix(data = seq(2, 9), by = 1, nrow = 4, ncol = 4); B
sum = A + B; sum
sub = A - B; sub
component.multilplication = A*B; component.multilplication
multiplication = A%*%B; multiplication
C = matrix(c(1, 2, 4, 6), nrow = 2, ncol = 2); C
determinant = det(C); determinant
inverse = solve(C); inverse
### Simultaneous system of equation solve
### Consider the following system of equations
### 2x + 3y = 5; 9x + 4y = 15
D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T ); D
b = matrix(c(5, 15), nrow = 2); b
### check whether solution exists or not
if(det(D) != 0)
print("there exist a unique solution")
D.inverse = solve(D); D.inverse
solution = D.inverse %*% b; solution
##### Rank of the matrix
### Several methods are available in mathematical literature to find the rank of the matrix
#### Here we consider a numerical procedure i.e. the QR decomposition methods
D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T ); D
rank = qr(D)$rank
### Otherwise you can also install the package "pracma" to find the rank of any matrix
install.packages("pracma")
library(pracma)
rank = Rank(D); rank
#### Some little bit linear algebra
#### Recall that matrix D
D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T); D
eigen.value = eigen(D)$values; eigen.value
eigen.vectors = eigen(D)$vectors; eigen.vectors
#table function
gender = factor(c(rep("female",91),rep("male",92)))
gender
table(gender)
gender = factor(gender, levels=c("male","female"))
table(gender)
gender = factor(gender, levels=c("Male","female"))
table(gender)
Assisgnment based on today's lecture:
1. Consider a 4 by 4 matrix and show that it can be written as a sum of symmetric and skew symmetric matrix
2. Consider another 4 by 4 symmetric matrix; show that the eigen values are all real.
3. Consider a skew symmetric square matrix of order 4 and show that eigen values are purely imaginary or 0
4. Consider any orthogonal matrix and show that the eigen vectors are of unit modulus.
Graph Plotting:
The graph plot or any kind of graphical analysis with that figure is very much important in the research work irrespective of the subject. So i am going to deliver a lecture on the graph plotting in the R software. The following code will help to demonstrate this.
### Graphical approach in R software ####
### First, i will show some basic methods
### Create a vector
x = c(1, 5, 7, 9, 11, 13, 20, 25); x
plot(x)
### In the following three codes note the differences
plot(x, type = "l") ### Returns a line graph
plot(x, type = "p") ### Returns a graph having only point i.e. scatter plot
plot(x, type = "b") ### Returns a graph having both point & line
### Some basic properties of the graph
plot(x, type = "b", xlab = "week", ylab = "price") ### Adding the label name in the graph
plot(x, type = "b", xlab = "week", ylab = "price", main = "Price increament in 8 weeks") ### Mentioning the caption above the graph
plot(x, type = "b", xlab = "week", ylab = "price", sub = "Price increament in 8 weeks") ### Mentioning the below above the graph
plot(x, type = "b", xlab = expression(bold(Week)),
ylab = expression(bold(Price)), main = "Price increament in 8 weeks") ## Bolding the label name
plot(x, type = "b", xlab = expression(bold(Week)),
ylab = expression(bold(Price)), main = "Price increament in 8 weeks", cex.lab = 1.3) ## Increase the label size
plot(x, type = "b", xlab = expression(bold(Week)),
ylab = expression(bold(Price)), main = "Price increament in 8 weeks",
cex.lab = 1.3, col = "blue", lwd = 2) ## Increase the label size
plot(x, type = "b", xlab = expression(bold(Week)),
ylab = expression(bold(Price)), main = "Price increament in 8 weeks",
cex.lab = 1.3, col = "blue", lwd = 2, xlim = c(1, 7), ylim = c(0, 15)) ## Increase the label size
### Adding legend to the graph
legend("topleft", legend = ("Stock market price"), col = "blue", lwd = 2, bty = "n")
#### Exporting the graph #####
### 1. Look at the right hand side where the plot has appeared. Just above that graph "Export"
### command is present.
### 2. Click on that command and select your type i.e. either in pdf version or image version.
### 3. If it is image version then after clicking the image a new window will appeared and just save ### your image properly.
#### Adding more another graph in a single a graph
y = seq(1, 10); y
lines(y, col = "red", lwd = 2, type = "b")
plot(x, type = "b", xlab = expression(bold(Week)),
ylab = expression(bold(Price)), main = "Price increament in 8 weeks",
cex.lab = 1.3, col = "blue", lwd = 2, xlim = c(1, 7), ylim = c(0, 15)) ## Increase the label size
points(y, col = "red", lwd = 2)
##### Graphical presentation of x vs y ######
plot(x, y, type = "b", col = "green", lwd = 1.5) #### Look at the output carefully. Try to understand it
x = seq(1, 20); y = seq(21, 40)
plot(x, y, lwd = 3, col = "orange", type = "b") #### Try to explain the functional form this graph
##### Now, we can draw the functional form of the above graph without using the plot command
m = 1; c = 5
curve(expr = m*x+c, from = 0, to = 50, col = "red",
lwd = 3, xlab = expression(bold(x)), ylab = expression(bold(y)), cex.lab = 1.2)
curve(expr = x^2, from = 0, to = 50, col = "blue", lwd = 3, add = T)
legend("bottomright", legend = c("straight line", "parabola"), col = c("red", "blue"), lwd = 3, bty = "n")
text(x = 30, y = 27, "Straight line")
text(x = 9, y = 31, "Parabola")
#### Find the intersecting point
### In order to do this we have to equate both the expression i.e. x^2 = x + 5 implies x^2 - x - 5 = 0
#### We have to find the root of the above equation only
### There are several numerical methods available in but we use just only the following command
f = function(x) x^2-x-5
root = uniroot(f, interval = c(0, 10))$root; root
### Check step whether our result is true or not
abline(v = root, lwd = 2, col = "green")
############################ MATPLOT #########################################
m = matrix(seq(1, 100), nrow = 10); m
matplot(m, type = "l", lwd = 4, lty = 1:10)
################# Pie Charts #####################
students<-c("A","B","C","D", "E","F")
marks<-c(54,12,58,39,30,55)
marksheet<-data.frame(students,marks); marksheet
marksheet<-transform(marksheet,percentage=(marksheet$marks/60)*100); marksheet
#attach(marksheet)
pie(marks,labels=c("A(90%)","B(20%)","C(96%)", "D(65%)", "E(50%)","F(91.7%)"), main="Marksheets of students", col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3", "purple1","violetred4"))
#detach(marksheet)
############# Barplot ##################
consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),
nrow=2,ncol=5,byrow=T)
rownames(consumption)<-c("Rural","Urban")
colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")
consumption
barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n
during 1989-'90",legend.text=rownames(consumption))
##### See the difference between the barplot and histogram
hist(consumption)
### Try to understand the difference
### Barplot gives you the frequency distribution only
### but histogram provides you both the frequency diagram and probability distribution
#### Detail analysis of the histogram
x = rnorm(n = 1000, mean = 0.3, sd = 1.2); x
h1 = hist(x, probability = F) ### for the frequency diagram
par(mfrow = c(2,2))
h2 = hist(x, probability = T, col = "orange") ### for the distribution diagram
h2 = hist(x, probability = T, breaks = 25, col = "grey")
print(h2)
library(Rlab)
hplot(x, breaks = 50) #### no need to input the argument probability = T
####### Zoom the graph ###############
library(zoom)
plot(rnorm(100),rnorm(100))
zoomplot.zoom(fact=5,x=0,y=0)
Moreover, if you want to plot any 3d graph or any kind of ocean view like structure then go through the package list as mentioned in the "package" section of my webpage.
Assignment:
Plot some known functions like sin x, cos x, log x, e^x, pdf of normal distribution in R software by considering x as a vector and check whether it matches with the actual image and put the necessary legends there.
Construct a matrix where each row has a certain maximum value and plot them togetherly and see what happens.
Regression method :
Regression is very much important thing in analyzing the data sets. It will provide a understanding of the data sets with the curve which e are going to fit. Moreover, with the help of the regression procedure we can guess the underlying model. There are several procedures are available regarding the regression technique. The following will give you the step by step procedure for handling any kind of data sets. The method which i opt is mentioned in the following an as per the following protocol the code is also arranged.
(i) The linear model fitting
(ii) The non linear model fitting
(iii) Model fitting with the help of the Maximize Likelihood technique (MLE)
(iv) Real data set handling.
######### Linear model fitting #######
###### First consider a dataset from the R software ########
dataset = cars ## dataset consideration
View(dataset) #### Excel like view
speed = dataset$speed ### extracting the speed data
distance = dataset$dist ### extracting the distance data
plot(speed, distance, xlab = expression(bold(Speed)),
ylab = expression(bold(Distance)), col = "red", lwd = 2, cex.lab = 1.5)
#### guess the associationship between the speed and the distance data
#### Now, we will move towards the fitting procedure
fit.linear = lm(distance~speed) ### Consider the relation distance = a*speed + b; b being the intercept
#### That means distance is the response variable and speed is the predictor.
fit.linear
summary(fit.linear) ### Summary will give you all the details of the estimation
intercept = as.numeric(fit.linear$coefficients[1]) ## if you do not include as.numeric then the word (intercept) will appear together with the value
intercept
slope = as.numeric(fit.linear$coefficients[2]); slope
lines(speed, fit.linear$fitted.values, col = "blue", lwd = 3)
legend("topleft", legend = c("estimated"), col = "blue", lwd = 3, bty = "n")
####### Non linear model fitting #####
##### "Puromycin" is also another data sets available in the R software. So we consider this data sets.
attach(Puromycin)
plot(Puromycin$conc,Puromycin$rate, xlab = expression(bold(Concentartaion)),
ylab = expression(bold(Rate)), cex.lab = 1.5)
data = Puromycin[order(conc),]
fit.linear.new = lm(rate~conc, data = data)
lines(data$conc, fit.linear.new$fitted.values, col = "blue", lwd = 2)
func=function(conc,a,k) a*conc/(k+conc)
fit.nonlinear=nls(rate ~ func(conc,a,k), data=data, start=c(a = 50, k = 0.05))
lines(data$conc,predict(fit.nonlinear), col = "red", lwd = 2)
AIC(fit.linear.new)
AIC(fit.nonlinear)
summary(fit.nonlinear)
########### Model fitting with maximum likelihood technique ###########
x = rnorm(10000000, 5, 2); x ## Extracting the 10000000 samples from the normal distribution
n = length(x); n
hist(x, probability = T, col = "grey", breaks = 50)
#### Be careful about the likelihood function. You have to write the negative of the log-likelihood
likelihood = function(mu, sigma)
{
n*log(sigma)+1/(2*sigma^2)*sum((x-mu)^2)
}
library(bbmle) ##### The package for using the Likelihood methods is "bbmle"
#library(stats4)
m = mle2(likelihood, start = list(sigma= 1, mu = 5))
summary(m)
mu = as.numeric(coef(m)[1]); mu
sigma = as.numeric(coef(m)[2]); sigma
y = rnorm(n, mu, sigma)
lines(density(x), col = "red", lwd = 3, lty = 2)
lines(density(y), col = "blue", lwd = 3, lty = 3)
################ Real data set handling #################
##### You can get the mock data set from the link (Data) .
##### Otherwise you can mail me.
setwd("D:/Ph.d/Teaching/MSQE_2019") ### setting the working directory
dataset = read.csv("Train_Tickets.csv") ### importing the datasets from the excel file
library(utils)
View(dataset) #### excel like view
count = dataset[,2] ### extracting all data
l = length(count); l
hist(count, probability = T, breaks = 25, col = "grey", main = "Assignment")
##### Distribution fitting with the help of "fitdistr" command
library(MASS)
y = fitdistr(count, "exponential") #### distributional fitting
lambda = as.numeric(y$estimate); lambda
curve(lambda*exp(-lambda*x), add = T, col = "red", lwd = 3, lty = 1)
##### fitting with MLE ####
##### Exponential distribution fitting ####
likelihood = function(lambda)
{
-l*log(lambda) + lambda* sum(count)
}
library(bbmle)
#library(stats4)
m = mle2(likelihood, start = list(lambda = 0.007))
summary(m)
lambda.MLE = as.numeric(coef(m)); lambda.MLE
curve(lambda.MLE*exp(-lambda.MLE*x), add = T, col = "blue", lwd = 3, lty = 3)
legend("topright", legend = c("fitdistr command", "MLE"),
col = c("red", "blue"), lwd = 3, lty = c(1, 3), bty = "n")
Some time it may happen that the data sets are available in grouped format. In order to perform the regression techniques for those kind of data sets use the package "grouped". The command "group" under this package will be the key one in solving this issue.