Basic Course on R software

This page is mainly designed to implement some basic knowledge about the R software. I am really thankful to Prof. Amita Majumder of Economics Research Unit (ERU), ISI, Kolkata and obviously to my supervisor Prof. Sabyasachi Bhattacharya (Head of Agricultural and Ecological Research Unit, ISI, Kolkata) to give me such an opportunity to share this knowledge with the students of MSQE , ISI, Kolkata.

I am very much grateful to one of my lab mate, Sinchan Ghosh (CSIR-JRF) for helping me in preparing this site.

Now, coming to the point of R software, it was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from http://www.r-project.org/ or you can download from . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.

In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs.

References:

An introduction to R, Longhow Lam. (Link for the material https://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf )
Applied Statistical Inference: Likelihood and Bayes by Leohard Held and Daniel Sabanes Bove, Springer-Verlag Berlin 2014
The R Student Companion, Brian Dennis, CRC Press, 2013.
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie and Tibshirani, Springer Text in Statistics 2013
Using R for Numerical Analysis in Science and Engineering by Victor A. Broomfield, CRC Press. Taylor and Francis Group 2014
A Primer of Ecology with R by M. Henry and H. Stevens, Springer 2009
Statistical Modeling: The Two Cultures by Leo Breiman, Statistical Science 2001, Vol. 16, No. 3, 199-231.
The Art of R Programming; Norman Matloff
AN R COMPANION FOR THE HANDBOOK OF BIOLOGICAL STATISTICS; SALVATORE S. MANGIAFICO : https://rcompanion.org/documents/RCompanionBioStatistics.pdf

Course Materials:

Lecture 1 ( 9th August, 2019)

Introduction to R programming

x = 2 ### storing a particular value

##### To run any code use "Ctrl + Enter" togetherly from your key board.

print(x) #### Print that value

y = 7

print(y)

###### Some Fundamental operations ###

z1 = x+y ### Addition of two numbers

z2 = x-y ### Subtraction of two numbers

z3 = x*y; z3 ### Multiplication of two numbers

z4 = x/y; z4 #### Division of two numbers

##### Storing of more than one numbers and perform the same operation as mentioned above

### Vector creation

x1 = c(3, 5, 2, 6) ##### 4 numbers are stored here and c stands for concatenation

length(x1) ##### Check that how many numbers are present there

y1 = c(10, 5, 3, 9)

length(y1)

#### Repeat the same process

a = x1 + y1; a

e = sum(x1, y1); e #### Using R command we can also compute the summation

b = x1 - y1; b

c = x1*y1; c

d = x1/y1; d

#### Sequence of number generation

#### There are two ways of number generation.

##### The first method can be applied if the common difference

##### between the numbers are only 1 and that of

###### second method is applicable for all possible of common difference.

x1 = 1:10; x1 ### 1st 10 natural numbers will generate here

x2 = seq(from = 1, to = 10, by = 3) ### Note that the common difference between the numbers be 3

#### Length wise number generation

x3 = seq(1, 50, length.out = 30) #### Some of the appeared numbers should be fraction

###### Use of some inbuilt functions

x = -10

abs(x) ## Returns the modulus i.e. the magnitude of the number stored under this variable

y = 1.02365

round(y)

round(y, digits = 2) #### Returns a number where 2 digits will appear after the decimal point

z = log(1); z #### Verify the value of log 1, whether it is zero or not

p = log(exp(1)); p

q = log(2, base = 10); q #### Here the base value is different

log10(2)

cat("The above two values are same")

numeric(10) ### Returns the value zero

rep(2, 30) ### Repeat the number 2, 30 times

rep(NA, 20) ### Repeat the logical variable NA 20 times

m = seq(1, 50, by = 1)

head(m) ##### Printing first six outputs

head(m, n = 7L) ##### Printing the first 7 outputs

tail(m) ##### Printing last six outputs

### Revisiting the basic operations

x=c(5, 7.3, 8, 9.2, 6.3, 4, 10)

sum(x)

min(x)

max(x)

cumsum(x)

cumprod(x)

which.max(x)

which.min(x)

mean(x)

sd(x)

x>5

sum(x>5) #number of numbers which are greater then 5

rev(x)

length(x)

sort(x, decreasing = TRUE ); help(sort)

x[x>5] #gives the numbers which are greater then 5

sum(x[x>5])

all(x>0)

any(x>0)

any(x<0)

### Finding the position of number

m = c(1, 1, 5, 6, 9.3, 86, 20, 6) ### First form a vector

length(m)

which(m == 5) ###### This command is used to find the position of any number

length(which(m == 1)) #### Return the value of total number positions where the number is located

###### Identification of the type of value stored

## There are 4 types of variables available in the R software

## 1. Integer

## 2. Numeric

## 3. Character

## 4. Logical

class(3)

class(3L)

class("3.2")

class(NA)

#### Conversion of decimal number into integer

as.integer(5.9) #### Returns the value of only integer part

Lecture 2 ( 4th September, 2019)

Matrix formation:

We all know that Matrix is nothing but the array of vectors. In order to analyse any kind of data sets we have to convert it in a matrix format or sometimes we collect the data time by time. Then , we have to club together to form a matrix. So matrix always play a big role in case of statistical analysis of any data sets. In the following i will give you a brief demonstration about the matrix formation in R software.

####### Matrix formation using R software

#### In R software matrix can be formed in 4 ways :

## 1. Binding process --- (i) Row bind; (ii) Column bind

## 2. Making a data frame ----- Use the command data.matrix to convert it into matrix

## ******** 3. Using Matrix command

## 4. Using diag command

#### Part 1 -- Binding process

### First create a vector

student1 = c(75, 65, 85, 95)

student2 = c(50, 60, 65, 75)

student3 = c(50, 60, 65, 75)

### Now, we will execute the first binding process i.e. row binding

Marks.data.1 = rbind(student1, student2, student3); Marks.data.1 ### Output is the matrix having separate ##rownames

#### The following command provides seperate column names so that

#### it will be very easy to understand the data structure

colnames(Marks.data.1) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); Marks.data.1

##### Now, we will execute the 2nd binding process

Marks.data.2 = cbind(student1, student2, student3); Marks.data.2

### The following command provides the seperate row names

rownames(Marks.data.2) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); Marks.data.2

####### Now, coming to the second part of matrix formation

data = data.frame(student1, student2, student3); data

summary(data)

head(data, 2) ### printing the head values

complete.cases(data) ### printing the missing values

data = data.frame(student1, student2, student3); data

### Look at the output, beside giving the marks of the students, it will also provide a serial number

### This is the specialty of the data frame. That means if you have any discrepancy about any data

### You can easily find it.

### Now, we will convert the data frame into matrix

new.data = data.matrix(data); new.data

### Again we can insert the row names here

rownames(new.data) = c("Mid Sem 1", "Sem 1", "Mid Sem 2", "Sem 2"); new.data

#### Now, we can use the most important tool of R software in order to construct any matrix

M1 = matrix(c(student1, student2, student3), nrow = 4, byrow = F); M1

M2 = matrix(c(student1, student2, student3), nrow = 4, ncol = 3, byrow = T); M2

### Try to understand the difference between these two outputs

### Difference between the data frames and matrix

x = student1

A = Marks.data.1

list = list(x, A, data); list

list[[1]]

newlist=list(x, A, data, list); newlist

newlist[[4]]

x=c(1,2,3,NA,3,5,8,NA,10,2) #Treating Missing Values

is.na(x) ### Returns the logical variable

!is.na(x)

sum(is.na(x)) # Number of missing values in the column

sum(!is.na(x)) # Number of non-missing values in the column

## Finally we consider the following matrix

M = matrix(c(student1, student2, student3), nrow = 3, byrow = T); M

### Some basic operations on matrix

## Transpose of the matrix

M.transpose = t(M); M.transpose

### Extracting rows and columns from any matrix

### R software denotes the rows of any matrix can by M[i,] and that of column by M[,j]

First.student = M[1,]; First.student

Mid.sem.1 = M[,1]; Mid.sem.1

##### method of constructing the diagonal matrix

M.diagonal = diag(c(1, 2, 3, 4, 5)); M.diagonal

##### Addition, Subtraction, Multiplication, Determinant, Inverse computation

A = matrix(data = c(10, 122, 113, 515, 120, 100, 24, 25), nrow = 4, ncol = 4); A

B = matrix(data = seq(2, 9), by = 1, nrow = 4, ncol = 4); B

sum = A + B; sum

sub = A - B; sub

component.multilplication = A*B; component.multilplication

multiplication = A%*%B; multiplication

C = matrix(c(1, 2, 4, 6), nrow = 2, ncol = 2); C

determinant = det(C); determinant

inverse = solve(C); inverse

### Simultaneous system of equation solve

### Consider the following system of equations

### 2x + 3y = 5; 9x + 4y = 15

D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T ); D

b = matrix(c(5, 15), nrow = 2); b

### check whether solution exists or not

if(det(D) != 0)

print("there exist a unique solution")

D.inverse = solve(D); D.inverse

solution = D.inverse %*% b; solution

##### Rank of the matrix

### Several methods are available in mathematical literature to find the rank of the matrix

#### Here we consider a numerical procedure i.e. the QR decomposition methods

D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T ); D

rank = qr(D)$rank

### Otherwise you can also install the package "pracma" to find the rank of any matrix

install.packages("pracma")

library(pracma)

rank = Rank(D); rank

#### Some little bit linear algebra

#### Recall that matrix D

D = matrix(c(2, 3, 9, 4), nrow = 2, byrow = T); D

eigen.value = eigen(D)$values; eigen.value

eigen.vectors = eigen(D)$vectors; eigen.vectors

#table function

gender = factor(c(rep("female",91),rep("male",92)))

gender

table(gender)

gender = factor(gender, levels=c("male","female"))

table(gender)

gender = factor(gender, levels=c("Male","female"))

table(gender)

Assisgnment based on today's lecture:

1. Consider a 4 by 4 matrix and show that it can be written as a sum of symmetric and skew symmetric matrix

2. Consider another 4 by 4 symmetric matrix; show that the eigen values are all real.

3. Consider a skew symmetric square matrix of order 4 and show that eigen values are purely imaginary or 0

4. Consider any orthogonal matrix and show that the eigen vectors are of unit modulus.

Lecture 3 ( 18th September, 2019)

Graph Plotting:

The graph plot or any kind of graphical analysis with that figure is very much important in the research work irrespective of the subject. So i am going to deliver a lecture on the graph plotting in the R software. The following code will help to demonstrate this.

### Graphical approach in R software ####

### First, i will show some basic methods

### Create a vector

x = c(1, 5, 7, 9, 11, 13, 20, 25); x

plot(x)

### In the following three codes note the differences

plot(x, type = "l") ### Returns a line graph

plot(x, type = "p") ### Returns a graph having only point i.e. scatter plot

plot(x, type = "b") ### Returns a graph having both point & line

### Some basic properties of the graph

plot(x, type = "b", xlab = "week", ylab = "price") ### Adding the label name in the graph

plot(x, type = "b", xlab = "week", ylab = "price", main = "Price increament in 8 weeks") ### Mentioning the caption above the graph

plot(x, type = "b", xlab = "week", ylab = "price", sub = "Price increament in 8 weeks") ### Mentioning the below above the graph

plot(x, type = "b", xlab = expression(bold(Week)),

ylab = expression(bold(Price)), main = "Price increament in 8 weeks") ## Bolding the label name

plot(x, type = "b", xlab = expression(bold(Week)),

ylab = expression(bold(Price)), main = "Price increament in 8 weeks", cex.lab = 1.3) ## Increase the label size

plot(x, type = "b", xlab = expression(bold(Week)),

ylab = expression(bold(Price)), main = "Price increament in 8 weeks",

cex.lab = 1.3, col = "blue", lwd = 2) ## Increase the label size

plot(x, type = "b", xlab = expression(bold(Week)),

ylab = expression(bold(Price)), main = "Price increament in 8 weeks",

cex.lab = 1.3, col = "blue", lwd = 2, xlim = c(1, 7), ylim = c(0, 15)) ## Increase the label size

### Adding legend to the graph

legend("topleft", legend = ("Stock market price"), col = "blue", lwd = 2, bty = "n")

#### Exporting the graph #####

### 1. Look at the right hand side where the plot has appeared. Just above that graph "Export"

### command is present.

### 2. Click on that command and select your type i.e. either in pdf version or image version.

### 3. If it is image version then after clicking the image a new window will appeared and just save ### your image properly.

#### Adding more another graph in a single a graph

y = seq(1, 10); y

lines(y, col = "red", lwd = 2, type = "b")

plot(x, type = "b", xlab = expression(bold(Week)),

ylab = expression(bold(Price)), main = "Price increament in 8 weeks",

cex.lab = 1.3, col = "blue", lwd = 2, xlim = c(1, 7), ylim = c(0, 15)) ## Increase the label size

points(y, col = "red", lwd = 2)

##### Graphical presentation of x vs y ######

plot(x, y, type = "b", col = "green", lwd = 1.5) #### Look at the output carefully. Try to understand it

x = seq(1, 20); y = seq(21, 40)

plot(x, y, lwd = 3, col = "orange", type = "b") #### Try to explain the functional form this graph

##### Now, we can draw the functional form of the above graph without using the plot command

m = 1; c = 5

curve(expr = m*x+c, from = 0, to = 50, col = "red",

lwd = 3, xlab = expression(bold(x)), ylab = expression(bold(y)), cex.lab = 1.2)

curve(expr = x^2, from = 0, to = 50, col = "blue", lwd = 3, add = T)

legend("bottomright", legend = c("straight line", "parabola"), col = c("red", "blue"), lwd = 3, bty = "n")

text(x = 30, y = 27, "Straight line")

text(x = 9, y = 31, "Parabola")

#### Find the intersecting point

### In order to do this we have to equate both the expression i.e. x^2 = x + 5 implies x^2 - x - 5 = 0

#### We have to find the root of the above equation only

### There are several numerical methods available in but we use just only the following command

f = function(x) x^2-x-5

root = uniroot(f, interval = c(0, 10))$root; root

### Check step whether our result is true or not

abline(v = root, lwd = 2, col = "green")

############################ MATPLOT #########################################

m = matrix(seq(1, 100), nrow = 10); m

matplot(m, type = "l", lwd = 4, lty = 1:10)

################# Pie Charts #####################

students<-c("A","B","C","D", "E","F")

marks<-c(54,12,58,39,30,55)

marksheet<-data.frame(students,marks); marksheet

marksheet<-transform(marksheet,percentage=(marksheet$marks/60)*100); marksheet

#attach(marksheet)

pie(marks,labels=c("A(90%)","B(20%)","C(96%)", "D(65%)", "E(50%)","F(91.7%)"), main="Marksheets of students", col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3", "purple1","violetred4"))

#detach(marksheet)

############# Barplot ##################

consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),

nrow=2,ncol=5,byrow=T)

rownames(consumption)<-c("Rural","Urban")

colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")

consumption

barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n

during 1989-'90",legend.text=rownames(consumption))

##### See the difference between the barplot and histogram

hist(consumption)

### Try to understand the difference

### Barplot gives you the frequency distribution only

### but histogram provides you both the frequency diagram and probability distribution

#### Detail analysis of the histogram

x = rnorm(n = 1000, mean = 0.3, sd = 1.2); x

h1 = hist(x, probability = F) ### for the frequency diagram

par(mfrow = c(2,2))

h2 = hist(x, probability = T, col = "orange") ### for the distribution diagram

h2 = hist(x, probability = T, breaks = 25, col = "grey")

print(h2)

library(Rlab)

hplot(x, breaks = 50) #### no need to input the argument probability = T

####### Zoom the graph ###############

library(zoom)

plot(rnorm(100),rnorm(100))

zoomplot.zoom(fact=5,x=0,y=0)

Moreover, if you want to plot any 3d graph or any kind of ocean view like structure then go through the package list as mentioned in the "package" section of my webpage.

Assignment:

Plot some known functions like sin x, cos x, log x, e^x, pdf of normal distribution in R software by considering x as a vector and check whether it matches with the actual image and put the necessary legends there.
Construct a matrix where each row has a certain maximum value and plot them togetherly and see what happens.

Lecture 4 ( 20th September, 2019)

Regression method :

Regression is very much important thing in analyzing the data sets. It will provide a understanding of the data sets with the curve which e are going to fit. Moreover, with the help of the regression procedure we can guess the underlying model. There are several procedures are available regarding the regression technique. The following will give you the step by step procedure for handling any kind of data sets. The method which i opt is mentioned in the following an as per the following protocol the code is also arranged.

(i) The linear model fitting

(ii) The non linear model fitting

(iii) Model fitting with the help of the Maximize Likelihood technique (MLE)

(iv) Real data set handling.

######### Linear model fitting #######

###### First consider a dataset from the R software ########

dataset = cars ## dataset consideration

View(dataset) #### Excel like view

speed = dataset$speed ### extracting the speed data

distance = dataset$dist ### extracting the distance data

plot(speed, distance, xlab = expression(bold(Speed)),

ylab = expression(bold(Distance)), col = "red", lwd = 2, cex.lab = 1.5)

#### guess the associationship between the speed and the distance data

#### Now, we will move towards the fitting procedure

fit.linear = lm(distance~speed) ### Consider the relation distance = a*speed + b; b being the intercept

#### That means distance is the response variable and speed is the predictor.

fit.linear

summary(fit.linear) ### Summary will give you all the details of the estimation

intercept = as.numeric(fit.linear$coefficients[1]) ## if you do not include as.numeric then the word (intercept) will appear together with the value

intercept

slope = as.numeric(fit.linear$coefficients[2]); slope

lines(speed, fit.linear$fitted.values, col = "blue", lwd = 3)

legend("topleft", legend = c("estimated"), col = "blue", lwd = 3, bty = "n")

####### Non linear model fitting #####

##### "Puromycin" is also another data sets available in the R software. So we consider this data sets.

attach(Puromycin)

plot(Puromycin$conc,Puromycin$rate, xlab = expression(bold(Concentartaion)),

ylab = expression(bold(Rate)), cex.lab = 1.5)

data = Puromycin[order(conc),]

fit.linear.new = lm(rate~conc, data = data)

lines(data$conc, fit.linear.new$fitted.values, col = "blue", lwd = 2)

func=function(conc,a,k) a*conc/(k+conc)

fit.nonlinear=nls(rate ~ func(conc,a,k), data=data, start=c(a = 50, k = 0.05))

lines(data$conc,predict(fit.nonlinear), col = "red", lwd = 2)

AIC(fit.linear.new)

AIC(fit.nonlinear)

summary(fit.nonlinear)

########### Model fitting with maximum likelihood technique ###########

x = rnorm(10000000, 5, 2); x ## Extracting the 10000000 samples from the normal distribution

n = length(x); n

hist(x, probability = T, col = "grey", breaks = 50)

#### Be careful about the likelihood function. You have to write the negative of the log-likelihood

likelihood = function(mu, sigma)

{

n*log(sigma)+1/(2*sigma^2)*sum((x-mu)^2)

}

library(bbmle) ##### The package for using the Likelihood methods is "bbmle"

#library(stats4)

m = mle2(likelihood, start = list(sigma= 1, mu = 5))

summary(m)

mu = as.numeric(coef(m)[1]); mu

sigma = as.numeric(coef(m)[2]); sigma

y = rnorm(n, mu, sigma)

lines(density(x), col = "red", lwd = 3, lty = 2)

lines(density(y), col = "blue", lwd = 3, lty = 3)

################ Real data set handling #################

##### You can get the mock data set from the link (Data) .

##### Otherwise you can mail me.

setwd("D:/Ph.d/Teaching/MSQE_2019") ### setting the working directory

dataset = read.csv("Train_Tickets.csv") ### importing the datasets from the excel file

library(utils)

View(dataset) #### excel like view

count = dataset[,2] ### extracting all data

l = length(count); l

hist(count, probability = T, breaks = 25, col = "grey", main = "Assignment")

##### Distribution fitting with the help of "fitdistr" command

library(MASS)

y = fitdistr(count, "exponential") #### distributional fitting

lambda = as.numeric(y$estimate); lambda

curve(lambda*exp(-lambda*x), add = T, col = "red", lwd = 3, lty = 1)

##### fitting with MLE ####

##### Exponential distribution fitting ####

likelihood = function(lambda)

{

-l*log(lambda) + lambda* sum(count)

}

library(bbmle)

#library(stats4)

m = mle2(likelihood, start = list(lambda = 0.007))

summary(m)

lambda.MLE = as.numeric(coef(m)); lambda.MLE

curve(lambda.MLE*exp(-lambda.MLE*x), add = T, col = "blue", lwd = 3, lty = 3)

legend("topright", legend = c("fitdistr command", "MLE"),

col = c("red", "blue"), lwd = 3, lty = c(1, 3), bty = "n")

Some time it may happen that the data sets are available in grouped format. In order to perform the regression techniques for those kind of data sets use the package "grouped". The command "group" under this package will be the key one in solving this issue.

Page updated

Google Sites

Report abuse