Basic Course on R software

This page is mainly designed to implement some basic knowledge about the R software. I am really thankful to Prof. Amita Majumder of Economics Research Unit (ERU), ISI, Kolkata and obviously to my supervisor Prof. Sabyasachi Bhattacharya (Head of Agricultural and Ecological Research Unit, ISI, Kolkata) to give me such an opportunity to share this knowledge with the students of MSQE , ISI, Kolkata. 

I am very much grateful to one of my lab mate, Sinchan Ghosh (CSIR-JRF) for helping me in preparing this site. 

Now, coming to the point of R software, it was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from  http://www.r-project.org/  or you can download from . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.

In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs. 

References:

Course Materials:

Introduction to R programming


x = 2 ### storing a particular value

##### To run any code use "Ctrl + Enter" togetherly from your key board.

print(x) #### Print that value

x

y = 7

print(y)

y

###### Some Fundamental operations ###


z1 = x+y ### Addition of two numbers

z1

z2 = x-y ### Subtraction of two numbers

z2

z3 = x*y; z3 ### Multiplication of two numbers

z4 = x/y; z4 #### Division of two numbers


##### Storing of more than one numbers and perform the same operation as mentioned above


### Vector creation


x1 = c(3, 5, 2, 6) ##### 4 numbers are stored here and c stands for concatenation

length(x1) ##### Check that how many numbers are present there

y1 = c(10, 5, 3, 9) 

length(y1)


#### Repeat the same process  


a = x1 + y1; a

e = sum(x1, y1); e #### Using R command we can also compute the summation

b = x1 - y1; b

c = x1*y1; c

d = x1/y1; d


#### Sequence of number generation 


#### There are two ways of number generation. 

##### The first method can be applied if the common difference 

##### between the numbers are only 1 and that of 

###### second method is applicable for all possible of common difference.


x1 = 1:10; x1 ### 1st 10 natural numbers will generate here

x2 = seq(from = 1, to = 10, by = 3) ### Note that the common difference between the numbers be 3


#### Length wise number generation 


x3 = seq(1, 50, length.out = 30) #### Some of the appeared numbers should be fraction 


###### Use of some inbuilt functions  


x = -10

abs(x) ## Returns the modulus i.e. the magnitude of the number stored under this variable


y = 1.02365

round(y)

round(y, digits = 2) #### Returns a number where 2 digits will appear after the decimal point


z = log(1); z #### Verify the value of log 1, whether it is zero or not

p = log(exp(1)); p

q = log(2, base = 10); q #### Here the base value is different  

log10(2) 

cat("The above two values are same")


numeric(10) ### Returns the value zero

rep(2, 30) ### Repeat the number 2, 30 times 

rep(NA, 20) ### Repeat the logical variable NA 20 times


m = seq(1, 50, by = 1)

head(m) ##### Printing first six outputs

head(m, n = 7L) ##### Printing the first 7 outputs

tail(m) ##### Printing last six outputs


### Revisiting the basic operations 


x=c(5, 7.3, 8, 9.2, 6.3, 4, 10) 

sum(x)

min(x)

max(x)

cumsum(x)

cumprod(x)

which.max(x)

which.min(x)

mean(x)

sd(x)

x>5

sum(x>5) #number of numbers which are greater then 5

rev(x)

length(x)

sort(x, decreasing = TRUE ); help(sort)

x[x>5] #gives the numbers which are greater then 5

sum(x[x>5])

all(x>0)

any(x>0)

any(x<0)

 

### Finding the position of number 


m = c(1, 1, 5, 6, 9.3, 86, 20, 6) ### First form a vector 

length(m)

which(m == 5) ###### This command is used to find the position of any number

length(which(m == 1)) #### Return the value of total number positions where the number is located


###### Identification of the type of value stored


## There are 4 types of variables available in the R software 

## 1. Integer

## 2. Numeric

## 3. Character 

## 4. Logical


class(3)

class(3L)

class("3.2")

class(NA)


#### Conversion of decimal number into integer


as.integer(5.9) #### Returns the value of only integer part


Matrix formation:

We all know that Matrix is nothing but the array of vectors.  In order to analyse any kind of data sets we have to convert it in a matrix format or sometimes we collect the data time by time. Then , we have to club together to form a matrix. So matrix always play a big role in case of statistical analysis of any data sets. In the following i will give you a brief demonstration about the matrix formation in R software. 

####### Matrix formation using R software 

#### In R software matrix can be formed in 4 ways :

## 1. Binding process --- (i) Row bind; (ii) Column bind

## 2. Making a data frame ----- Use the command data.matrix to convert it into matrix

## ******** 3. Using Matrix command 

## 4. Using diag command

#### Part 1 -- Binding process 

### First create a vector 


student1 = c(75, 65, 85, 95)

student2 = c(50, 60, 65, 75)

student3 = c(50, 60, 65, 75)


### Now, we will execute the first binding process i.e. row binding


Marks.data.1 = rbind(student1, student2, student3); Marks.data.1 ### Output is the matrix having separate ##rownames


#### The following command provides seperate column names so that 

#### it will be very easy to understand the data structure


colnames(Marks.data.1) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); Marks.data.1 


##### Now, we will execute the 2nd binding process 


Marks.data.2 = cbind(student1, student2, student3); Marks.data.2


### The following command provides the seperate row names


rownames(Marks.data.2) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); Marks.data.2 


####### Now, coming to the second part of matrix formation


data = data.frame(student1, student2, student3); data


summary(data)


head(data, 2) ### printing the head values


complete.cases(data) ### printing the missing values 


data = data.frame(student1, student2, student3); data


### Look at the output, beside giving the marks of the students, it will also provide a serial number

### This is the specialty of the data frame. That means if you have any discrepancy about any data

### You can easily find it.


### Now, we will convert the data frame into matrix


new.data = data.matrix(data); new.data


### Again we can insert the row names here


rownames(new.data) =  c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); new.data 


#### Now, we can use the most important tool of R software in order to construct any matrix


M1 = matrix(c(student1, student2, student3), nrow = 4, byrow = F); M1


M2 = matrix(c(student1, student2, student3), nrow = 4, ncol = 3, byrow = T); M2


### Try to understand the difference between these two outputs

### Difference between the data frames and matrix


x = student1


A = Marks.data.1


list = list(x, A, data); list


list[[1]]

newlist=list(x, A, data, list); newlist

newlist[[4]]

x=c(1,2,3,NA,3,5,8,NA,10,2) #Treating Missing Values


is.na(x) ### Returns the logical variable 


!is.na(x) 


sum(is.na(x)) # Number of missing values in the column

sum(!is.na(x)) # Number of non-missing values in the column


## Finally we consider the following matrix


M = matrix(c(student1, student2, student3), nrow = 3, byrow = T); M


### Some basic operations on matrix

## Transpose of the matrix


M.transpose = t(M); M.transpose


### Extracting rows and columns from any matrix

### R software denotes the rows of any matrix can by M[i,] and that of column by M[,j]


First.student = M[1,]; First.student


Mid.sem.1 = M[,1]; Mid.sem.1


##### method of constructing the diagonal matrix


M.diagonal = diag(c(1, 2, 3, 4, 5)); M.diagonal


##### Addition, Subtraction, Multiplication, Determinant, Inverse computation


A = matrix(data = c(10, 122, 113, 515, 120, 100, 24, 25), nrow = 4, ncol = 4); A


B = matrix(data = seq(2, 9), by = 1, nrow = 4, ncol = 4); B


sum = A + B; sum


sub = A - B; sub


component.multilplication = A*B; component.multilplication


multiplication = A%*%B; multiplication


C = matrix(c(1, 2, 4, 6), nrow = 2, ncol = 2); C


determinant = det(C); determinant


inverse = solve(C); inverse


### Simultaneous system of equation solve

### Consider the following system of equations

### 2x + 3y = 5; 9x + 4y = 15


D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T ); D


b = matrix(c(5, 15), nrow = 2); b


### check whether solution exists or not

if(det(D) != 0)

  print("there exist a unique solution")

 

D.inverse = solve(D); D.inverse


solution = D.inverse %*% b; solution


##### Rank of the matrix

### Several methods are available in mathematical literature to find the rank of the matrix

#### Here we consider a numerical procedure i.e. the QR decomposition methods 


D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T ); D


rank = qr(D)$rank


### Otherwise you can also install the package "pracma" to find the rank of any matrix


install.packages("pracma")


library(pracma)


rank = Rank(D); rank


#### Some little bit linear algebra 

#### Recall that matrix D


D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T); D


eigen.value = eigen(D)$values; eigen.value


eigen.vectors = eigen(D)$vectors; eigen.vectors


#table function


gender = factor(c(rep("female",91),rep("male",92)))


gender


table(gender)


gender = factor(gender, levels=c("male","female"))


table(gender)


gender = factor(gender, levels=c("Male","female"))

table(gender)




Assisgnment based on today's lecture:

1. Consider a 4 by 4 matrix and show that it can be written as a sum of symmetric and skew symmetric matrix

2. Consider another 4 by 4 symmetric matrix; show that the eigen values are all real. 

3. Consider a skew symmetric square matrix of order 4 and show that eigen values are purely imaginary or 0

4. Consider any orthogonal matrix and show that the eigen vectors are of unit modulus.

Graph Plotting:

The graph plot or any kind of graphical analysis with that figure is very much important in the research work irrespective of the subject. So i am going to deliver a lecture on the graph plotting in the R software. The following code will help to demonstrate this. 


### Graphical approach in R software ####


### First, i will show some basic methods 


### Create a vector


x = c(1, 5, 7, 9, 11, 13, 20, 25); x


plot(x)


### In the following three codes note the differences


plot(x, type = "l") ### Returns a line graph


plot(x, type = "p") ### Returns a graph having only point i.e. scatter plot


plot(x, type = "b") ### Returns a graph having both point & line


### Some basic properties of the graph


plot(x, type = "b", xlab = "week", ylab = "price") ### Adding the label name in the graph


plot(x, type = "b", xlab = "week", ylab = "price", main = "Price increament in 8 weeks") ### Mentioning the caption above the graph


plot(x, type = "b", xlab = "week", ylab = "price", sub = "Price increament in 8 weeks") ### Mentioning the below above the graph


plot(x, type = "b", xlab = expression(bold(Week)), 

ylab = expression(bold(Price)), main = "Price increament in 8 weeks") ## Bolding the label name 


plot(x, type = "b", xlab = expression(bold(Week)), 

     ylab = expression(bold(Price)), main = "Price increament in 8 weeks", cex.lab = 1.3) ## Increase the label size 


plot(x, type = "b", xlab = expression(bold(Week)), 

     ylab = expression(bold(Price)), main = "Price increament in 8 weeks", 

     cex.lab = 1.3, col = "blue", lwd = 2) ## Increase the label size 


plot(x, type = "b", xlab = expression(bold(Week)), 

     ylab = expression(bold(Price)), main = "Price increament in 8 weeks", 

     cex.lab = 1.3, col = "blue", lwd = 2, xlim = c(1, 7), ylim = c(0, 15)) ## Increase the label size 




### Adding legend to the graph


legend("topleft", legend = ("Stock market price"), col = "blue", lwd = 2, bty = "n")


#### Exporting the graph #####


### 1. Look at the right hand side where the plot has appeared. Just above that graph "Export" 

###    command is present. 


### 2. Click on that command and select your type i.e. either in pdf version or image version.


### 3. If it is image version then after clicking the image a new window will appeared and just save ###    your image properly.  


#### Adding more another graph in a single a graph


y = seq(1, 10); y


lines(y, col = "red", lwd = 2, type = "b")


plot(x, type = "b", xlab = expression(bold(Week)), 

     ylab = expression(bold(Price)), main = "Price increament in 8 weeks", 

     cex.lab = 1.3, col = "blue", lwd = 2, xlim = c(1, 7), ylim = c(0, 15)) ## Increase the label size 


points(y, col = "red", lwd = 2)


##### Graphical presentation of x vs y ###### 


plot(x, y, type = "b", col = "green", lwd = 1.5) #### Look at the output carefully. Try to understand it


x = seq(1, 20); y = seq(21, 40)


plot(x, y, lwd = 3, col = "orange", type = "b") #### Try to explain the functional form this graph


##### Now, we can draw the functional form of the above graph without using the plot command


m = 1; c = 5


curve(expr = m*x+c, from = 0, to = 50, col = "red", 

      lwd = 3, xlab = expression(bold(x)), ylab = expression(bold(y)), cex.lab = 1.2)


curve(expr = x^2, from = 0, to = 50, col = "blue", lwd = 3, add = T)


legend("bottomright", legend = c("straight line", "parabola"), col = c("red", "blue"), lwd = 3, bty = "n")


text(x = 30, y = 27, "Straight line")


text(x = 9, y = 31, "Parabola")


#### Find the intersecting point


### In order to do this we have to equate both the expression i.e. x^2 = x + 5 implies x^2 - x - 5 = 0 

#### We have to find the root of the above equation only

### There are several numerical methods available in but we use just only the following command


f = function(x) x^2-x-5


root = uniroot(f, interval = c(0, 10))$root; root


### Check step whether our result is true or not


abline(v = root, lwd = 2, col = "green")


############################ MATPLOT #########################################


m = matrix(seq(1, 100), nrow = 10); m


matplot(m, type = "l", lwd = 4, lty = 1:10)


################# Pie Charts #####################


students<-c("A","B","C","D", "E","F")

marks<-c(54,12,58,39,30,55)

marksheet<-data.frame(students,marks); marksheet

marksheet<-transform(marksheet,percentage=(marksheet$marks/60)*100); marksheet

#attach(marksheet)

pie(marks,labels=c("A(90%)","B(20%)","C(96%)", "D(65%)", "E(50%)","F(91.7%)"), main="Marksheets of students", col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3", "purple1","violetred4"))

#detach(marksheet)


#############  Barplot ##################


consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),

                    nrow=2,ncol=5,byrow=T)

rownames(consumption)<-c("Rural","Urban")

colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")

consumption

barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n

during 1989-'90",legend.text=rownames(consumption))



##### See the difference between the barplot and histogram


hist(consumption)


### Try to understand the difference

### Barplot gives you the frequency distribution only

### but histogram provides you both the frequency diagram and probability distribution


#### Detail analysis of the histogram


x = rnorm(n = 1000, mean = 0.3, sd = 1.2); x


h1 = hist(x, probability = F) ### for the frequency diagram


par(mfrow = c(2,2))


h2 = hist(x, probability = T, col = "orange") ### for the distribution diagram


h2 = hist(x, probability = T, breaks = 25, col = "grey")


print(h2)


library(Rlab) 


hplot(x, breaks = 50) #### no need to input the argument probability = T


####### Zoom the graph ###############


library(zoom)

plot(rnorm(100),rnorm(100))

zoomplot.zoom(fact=5,x=0,y=0)


 


Moreover, if you want to plot any 3d graph or any kind of ocean view like structure then go through the package list as mentioned in the "package" section of my webpage.

Assignment:


Regression method :

Regression is very much important thing in analyzing the data sets. It will provide a understanding of the data sets with the curve which e are going to fit. Moreover, with the help of the regression procedure we can guess the underlying model. There are several procedures are available regarding the regression technique. The following will give you the step by step procedure for handling any kind of data sets. The method which i opt is mentioned in the following an as per the following protocol the code is also arranged.

(i) The linear model fitting

(ii)  The non linear model fitting

(iii) Model fitting with the help of the Maximize Likelihood technique (MLE)

(iv) Real data set handling.

######### Linear model fitting #######

###### First consider a dataset from the R software ########


dataset = cars ## dataset consideration


View(dataset) #### Excel like view


speed = dataset$speed ### extracting the speed data


distance = dataset$dist ### extracting the distance data


plot(speed, distance, xlab = expression(bold(Speed)), 

     ylab = expression(bold(Distance)), col = "red", lwd = 2, cex.lab = 1.5)


#### guess the associationship between the speed and the distance data 

#### Now, we will move towards the fitting procedure


fit.linear = lm(distance~speed) ### Consider the relation distance = a*speed + b; b being the intercept


#### That means distance is the response variable and speed is the predictor.


fit.linear


summary(fit.linear) ### Summary will give you all the details of the estimation


intercept = as.numeric(fit.linear$coefficients[1]) ## if you do not include as.numeric then the word (intercept) will appear together with the value

intercept


slope = as.numeric(fit.linear$coefficients[2]); slope


lines(speed, fit.linear$fitted.values, col = "blue", lwd = 3)


legend("topleft", legend = c("estimated"), col = "blue", lwd = 3, bty = "n")


####### Non linear model fitting #####

##### "Puromycin" is also another data sets available in the R software. So we consider this data sets. 


attach(Puromycin)


plot(Puromycin$conc,Puromycin$rate, xlab = expression(bold(Concentartaion)), 

     ylab = expression(bold(Rate)), cex.lab = 1.5)


data = Puromycin[order(conc),]


fit.linear.new = lm(rate~conc, data = data)


lines(data$conc, fit.linear.new$fitted.values, col = "blue", lwd = 2)


func=function(conc,a,k) a*conc/(k+conc)


fit.nonlinear=nls(rate ~ func(conc,a,k), data=data, start=c(a = 50, k = 0.05))


lines(data$conc,predict(fit.nonlinear), col = "red", lwd = 2)


AIC(fit.linear.new)


AIC(fit.nonlinear)


summary(fit.nonlinear)



########### Model fitting with maximum likelihood technique ###########


x = rnorm(10000000, 5, 2); x ## Extracting the 10000000 samples from the normal distribution

n = length(x); n

hist(x, probability = T, col = "grey", breaks = 50)


#### Be careful about the likelihood function. You have to write the negative of the log-likelihood


likelihood = function(mu, sigma)

{

  n*log(sigma)+1/(2*sigma^2)*sum((x-mu)^2)

}


library(bbmle) ##### The package for using the Likelihood methods is "bbmle"


#library(stats4)


m = mle2(likelihood, start = list(sigma= 1, mu = 5))


summary(m)


mu = as.numeric(coef(m)[1]); mu


sigma = as.numeric(coef(m)[2]); sigma


y = rnorm(n, mu, sigma)


lines(density(x), col = "red", lwd = 3, lty = 2)


lines(density(y), col = "blue", lwd = 3, lty = 3)



################ Real data set handling #################

##### You can get the mock data set from the link (Data) .

##### Otherwise you can mail me.  


setwd("D:/Ph.d/Teaching/MSQE_2019") ### setting the working directory


dataset = read.csv("Train_Tickets.csv") ### importing the datasets from the excel file


library(utils)


View(dataset) #### excel like view


count = dataset[,2] ### extracting all data


l = length(count); l


hist(count, probability = T, breaks = 25, col = "grey", main = "Assignment")


##### Distribution fitting with the help of "fitdistr" command 


library(MASS) 


y = fitdistr(count, "exponential") #### distributional fitting


lambda = as.numeric(y$estimate); lambda 


curve(lambda*exp(-lambda*x), add = T, col = "red", lwd = 3, lty = 1)


##### fitting with MLE ####


##### Exponential distribution fitting ####


likelihood = function(lambda)

{

  -l*log(lambda) + lambda* sum(count)

}


library(bbmle)


#library(stats4)


m = mle2(likelihood, start = list(lambda = 0.007))


summary(m)


lambda.MLE = as.numeric(coef(m)); lambda.MLE


curve(lambda.MLE*exp(-lambda.MLE*x), add = T, col = "blue", lwd = 3, lty = 3)


legend("topright", legend = c("fitdistr command", "MLE"), 

       col = c("red", "blue"), lwd = 3, lty = c(1, 3), bty = "n")

Some time it may happen that the data sets are available in grouped format. In order to perform the regression techniques for those kind of data sets use the package "grouped". The command "group" under this package will be the key one in solving this issue.