Workshop in the Deparment of Anthropology, Dibrugarh University 

(29th February-5th March, 2020)

The workshop is organized by the Department of Anthropology, Dibrugarh University in collaboration with the Indian Statistical Institute, Kolkata. The main motto of this workshop is to refresh/update the knowledge of statistics in collection with the empirical data. Now a days the analysis of this kind of data sets require a good prior knowledge in any software. The most popular statistical software in this domain is the R software. The  main reason behind the popularity of this software is that it is free of cost and you can easily download this software from the provided link (R software). Apart from this another reason is the versatility i.e. this software is very much useful in every domain.

You can easily avail the schedule of this workshop named by "Emerging role of Statistics in Anthropology with application in R and SPSS" (dated 28th February to 5th March, 2020) from the link Schedule. The study material of the workshop will be uploaded soon. The data set which is to be used during this workshop is (i) Anthropometric Data ;(ii)  Logistic Regression Data; (iii) Paired Data. (iv) Tea worker data. Download this data set and keep it in your machine. 

 Introduction with R software:

It was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from  http://www.r-project.org/  or you can download from . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.

In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs. 

References:

Day 1 :

1. Basic Understanding of the sampling distribution with the help of R software:

set.seed(897)


############# Data


############## The unit of the height is in cm #################


Height = rnorm(1000, 165, 10) 


Height = round(Height, 0)


Height


#######graphical representation


plot(Height, xlab="students no.", ylab="Height")


######frequency distribution


par(mfrow=c(2, 2))


hist(Height)


#prob=T)

#lines(density(Height, kernal="gaussian"))


##########sample of size 100


sample100=sample(Height, 100)


sample100


######sample frequency distribution


hist(sample100, col="red", main = "Height distribution of 100 students")


##########sample of size 500


sample500=sample(Height, 500)


sample500


######sample frequency distribution


hist(sample500, col="blue", main = "Height distribution of 500 students")


##########sample of size 100 100 times


samplecomb=matrix(NA, 100, 100)


for (i in 1:100)


{


samplecomb[i, ]=sample(Height, 100)


}


sample.comb.row.mean=rowMeans(samplecomb[, 1:100])


hist(sample.comb.row.mean, col="green")


############# Data


set.seed(897)


Height.group.1 = rnorm(1000, 165, 10)



Height.group.1 = round(Height.group.1, 0)


############# Data


set.seed(897)


Height.group.2 = rnorm(1000, 171, 10) + rnorm(1000, 0, 2)


Height.group.2 = round(Height.group.2, 0)


#######graphical representation


plot(Height.group.1, Height.group.2)


Height.group.data = data.frame(Height.group.1, Height.group.2)


cor.height.group.1.2=cor(Height.group.1, Height.group.2); cor.height.group.1.2


fit.linear=lm(Height.group.2~Height.group.1, Height.group.data)


abline(fit.linear, col = "red")

3. Matrix in R software:

The term "Matrix" is very common in almost every domain of science. As most of the scientific research are dealt with the data set. Frankly speaking matrix is nothing but an array of vectors or in a simple way the structure "matrix" gives you a systematic presentation of the data. The study material for the matrix formation is Matrix_Study_Material.  Since we are dealing with the data set so it is very important to us to learn about how to write code in R to construct a matrix. The following will demonstrate this issue.

####### Matrix formation using R software

#### In R software matrix can be formed in 4 ways :

## 1. Binding process --- (i) Row bind; (ii) Column bind

## 2. Making a data frame ----- Use the command data.matrix to convert it into matrix

## ******** 3. Using Matrix command

## 4. Using diag command

#### Part 1 -- Binding process

### First create a vector


group1 = c(175, 165, 185, 195)

group2 = c(150, 160, 165, 175)

group3 = c(150, 160, 165, 175)


### Now, we will execute the first binding process i.e. row binding


Height.data.1 = rbind(group1, group2, group3); Height.data.1 ### Output is the matrix having separate ##rownames


#### The following command provides seperate column names so that

#### it will be very easy to understand the data structure


colnames(Height.data.1) = c("individual 1", "individual 2", "individual 3", "individual 4"); Height.data.1


##### Now, we will execute the 2nd binding process


Height.data.2 = cbind(group1, group2, group3); Height.data.2


### The following command provides the seperate row names


rownames(Height.data.2) = c(" individual 1", "individual 2", "individual 3", "individual 4"); Height.data.2


####### Now, coming to the second part of matrix formation


data = data.frame(group1, group2, group3); data


summary(data)


head(data, 2) ### printing the head values


complete.cases(data) ### printing the missing values


data = data.frame(group1, group2, group3); data


### Look at the output, beside giving the Height of the groups, it will also provide a serial number

### This is the specialty of the data frame. That means if you have any discrepancy about any data

### You can easily find it.


### Now, we will convert the data frame into matrix


new.data = data.matrix(data); new.data


### Again we can insert the row names here


rownames(new.data) = c(" individual 1", "individual 2", "individual 3", "individual 4"); new.data


#### Now, we can use the most important tool of R software in order to construct any matrix


M1 = matrix(c(group1, group2, group3), nrow = 4, byrow = F); M1


M2 = matrix(c(group1, group2, group3), nrow = 4, ncol = 3, byrow = T); M2 



## Finally we consider the following matrix


M = matrix(c(group1, group2, group3), nrow = 3, byrow = T); M


### Some basic operations on matrix

## Transpose of the matrix


M.transpose = t(M); M.transpose


### Extracting rows and columns from any matrix

### R software denotes the rows of any matrix can by M[i,] and that of column by M[,j]


First.group = M[1,]; First.group


Mid.sem.1 = M[,1]; Mid.sem.1


##### method of constructing the diagonal matrix


M.diagonal = diag(c(1, 2, 3, 4, 5)); M.diagonal


##### Addition, Subtraction, Multiplication, Determinant, Inverse computation


A = matrix(data = c(10, 122, 113, 515, 120, 100, 24, 25), nrow = 4, ncol = 4); A


B = matrix(data = seq(2, 9), by = 1, nrow = 4, ncol = 4); B


sum = A + B; sum


sub = A - B; sub


component.multilplication = A*B; component.multilplication


multiplication = A%*%B; multiplication

4. Graph Plotting:

The graph plot or any kind of graphical analysis with that figure is very much important in the research work irrespective of the subject. In order to get the flavor of the different color you should look at the color page of the R software whose link is http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf . So i am going to deliver a lecture on the graph plotting in the R software.  The following code will help to demonstrate this.


### Graphical approach in R software ####


### First, i will show some basic methods


### Create a vector


x = c(160, 162, 163, 165, 168, 170); x


plot(x)


### In the following three codes note the differences


plot(x, type = "l") ### Returns a line graph


plot(x, type = "p") ### Returns a graph having only point i.e. scatter plot


plot(x, type = "b") ### Returns a graph having both point & line


### Some basic properties of the graph


plot(x, type = "b", xlab = "Number of Person", ylab = "Height") ### Adding the label name in the graph


plot(x, type = "b", xlab = "Number of Person", ylab = "Height", main = "Height of six individuals") ### Mentioning the caption above the graph


plot(x, type = "b", xlab = "Number of Person", ylab = "Height", sub = "Height of six individuals") ### Mentioning the below above the graph


plot(x, type = "b", xlab = expression(bold(Number)),

     ylab = expression(bold(Height)), main = "Height of six individuals") ## Bolding the label name


plot(x, type = "b", xlab = expression(bold(Number)),

     ylab = expression(bold(Height)), main = "Height of six individuals", cex.lab = 1.3) ## Increase the label size



plot(x, type = "b", xlab = expression(bold(Number)),

     ylab = expression(bold(Height)), main = "Height of six individuals",

     cex.lab = 1.3, col = "blue", lwd = 2) ### colour the graph


plot(x, type = "b", xlab = expression(bold(Number)),

     ylab = expression(bold(Height)), main = "Height of six individuals",

     cex.lab = 1.3, col = "blue", lwd = 2, ylim = c(160, 170), xlim = c(1, 8)) ## Increase the label size




### Adding legend to the graph


legend("topleft", legend = ("Height of 1st group"), col = "blue", lwd = 2, bty = "n")


#### Exporting the graph #####


### 1. Look at the right hand side where the plot has appeared. Just above that graph "Export"

### command is present.


### 2. Click on that command and select your type i.e. either in pdf version or image version.


### 3. If it is image version then after clicking the image a new window will appeared and just save ### your image properly.


#### Adding more another graph in a single a graph


y = seq(160, 165, length.out = 6); y


lines(y, col = "red", lwd = 2, type = "b")


#### Inserting another graph through scatter plot


plot(x, type = "b", xlab = expression(bold(Number)),

     ylab = expression(bold(Height)), main = "Height of individuals",

     cex.lab = 1.3, col = "blue", lwd = 2, ylim = c(160, 170), xlim = c(1, 8)) ## Increase the label size


points(y, col = "red", lwd = 2)


##### Graphical presentation of x vs y ######


plot(x, y, type = "b", col = "green", lwd = 1.5) #### Look at the output carefully. Try to understand it


x = seq(1, 20); y = seq(21, 40)


plot(x, y, lwd = 3, col = "orange", type = "b") #### Try to explain the functional form this graph


##### Now, we can draw the functional form of the above graph without using the plot command


m = 1; c = 5


curve(expr = m*x+c, from = 0, to = 50, col = "red",

      lwd = 3, xlab = expression(bold(x)), ylab = expression(bold(y)), cex.lab = 1.2)


curve(expr = x^2, from = 0, to = 50, col = "blue", lwd = 3, add = T)


legend("bottomright", legend = c("straight line", "parabola"), col = c("red", "blue"), lwd = 3, bty = "n")


text(x = 30, y = 27, "Straight line")


text(x = 9, y = 31, "Parabola")



############################ MATPLOT #########################################


m = matrix(seq(1, 100), nrow = 10); m


matplot(m, type = "l", lwd = 4, lty = 1:10)


################# Pie Charts #####################


students<-c("A","B","C","D", "E","F")

marks<-c(54,12,58,39,30,55)

marksheet<-data.frame(students,marks); marksheet

marksheet<-transform(marksheet,percentage=(marksheet$marks/60)*100); marksheet

attach(marksheet)

pie(marks,labels=c("A(90%)","B(20%)","C(96%)", "D(65%)", "E(50%)","F(91.7%)"), main="Marksheets of students", col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3", "purple1","violetred4"))

#detach(marksheet)


############# Barplot ##################


consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),

                    nrow=2,ncol=5,byrow=T)

rownames(consumption)<-c("Rural","Urban")

colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")

consumption

barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n

during 1989-'90",legend.text=rownames(consumption))



##### See the difference between the barplot and histogram


hist(consumption)


### Try to understand the difference

### Barplot gives you the frequency distribution only

### but histogram provides you both the frequency diagram and probability distribution


#### Detail analysis of the histogram


x = rnorm(n = 1000, mean = 0.3, sd = 1.2); x


h1 = hist(x, probability = F) ### for the frequency diagram


par(mfrow = c(2,2))


h2 = hist(x, probability = T, col = "orange") ### for the distribution diagram


h2 = hist(x, probability = T, breaks = 25, col = "grey")


print(h2)

Day 3:

So far what we have learned is the basic understanding of the R software with some simple code. These simple codes help us to familiarize with the R environment. Now today's all lectures (both given by Prof. Sabyasachi Bhattacharya and Mr. Ayan Paul) is totally devoted to the statistical analysis of the Anthropometric data. 

Now coming to the statistical analysis portion, it is very important to import data from any other software like SPSS, Stata, AMOS, Excel, Notepad etc. to the R environment. The link for to understand the fundamental code to import data from all of these aforementioned softwares to R is https://www.statmethods.net/input/exportingdata.html.  

5. Regression set up:

The statistical regression will explain the strength of association between the dependent and the independent variable.  The dependent variable is treated as the response variable and the independent variable is treated as the predictor variable i.e. basd on which we can predict our model. The following code will demonstrate this issue

setwd("D:/Ph.d/Kalyani_Economics") ### setting the working directory

data = read.csv("tea.csv") 

View(data) ### Matrix like view

ncol(data) 

#### Category wise division


data$Category[data$bmi < 18.5] = 1 #### suspecting tb

data$Category[data$bmi >=18.5] = 0 ###  no tb

View(data)

colnames(data)[20] = "bmi_cate"

View(data)

str(data)

data$bmi_cate = as.factor(data$bmi_cate)

str(data) #### Look at the last part


############# linear model ############


Data = data.frame(ht, wt); Data


#### The command for the linear model fitting is "lm"


fit.linear = lm(wt~ht)


summary(fit.linear) #### give the necessary information


R.squared = summary(fit.linear)$r.squared


plot(ht, wt, xlab = expression(bold(Height)), ylab = expression(bold(Weight))) 


lines(ht, predict(fit.linear), col = "blue", lwd = 2)


###### Non linear regression 


#### First we have to extract the data


fev7 = data$fev7

age = data$age


data1 = data.frame(fev7, age)

##### Sorting the data first #####

attach(data1)

new.data1 = data1[order(fev7),]

fit.nls = nls(age~b*exp(-a*fev7), data = new.data1, start = list(a = 0.5, b = 0.2))

summary(fit.nls)

a = as.numeric(coef(fit.nls)[1]); a

b = as.numeric(coef(fit.nls)[2]); b


plot(new.data$fev7, new.data$age, xlab = "Fev 7", ylab = "Age")

lines(new.data$fev7, predict(fit.nls), col = "red", lwd = 2)

6. Mean test:

There are several procedure to perform the mean test. The most popular method is the One sample t test, Paired t test, Two sample t test. All of these methods are to be used only for the two groups but when there exist more than two groups then these kind of methods fail to explain the mean difference between these groups. Then we should apply Analysis of Covariance (ANOVA) in order to test the equality of mean of more than on group. 

ANOVA:

Suppose we have n groups and we are going to test the equality of the mean of these groups. Then the hypothesis problem can be posed as H0:(m.1 = m.2 = ... = m.n) against H1: not H0. This process is termed as the ANOVA. The basic assumption of this method is that the underlying data follows the normal distribution and the variance between the groups are same i.e. they follow homoscedasticity rule. But in case of real data we should be aware of these two assumptions. Rather we should use any proper test like "Shapiro Wilcox test", "QQplot", "Histogram" etc. to judge whether our data follows normality or not. Again for the variance part we should be aware of the test "Bartlett test". If those tests provide a positive result then move forward other wise you can also perform "Box-Cox transformation" to make the data set normal. If this property does not hold then leave that data set. 


Logistic Regression:

The above discussed regression method is performed with the help of Least square theory. As both of the explanatory and the response variable are in continuous set up. Sometimes it may happen that the response variable is binary i.e. it has only two value i.e. "1" and "0". At that time we are unable to perform the least square theory. Then we apply the logistic regression. 

### One sample t test

one.sample.test = t.test(x = data$bmi, mu = 17, alternative = "two.sided")

if(one.sample.test$p.value < 0.05)

  print("The alternative hypotheis is accepted")


##### Paired t test 


setwd("C:/Users/user/Dropbox/ASSAM")

paired.data = read.csv("paired_data.csv")

View(paired.data)


x = paired.data$Before.Diet

y = paired.data$After.Diet

t.test(x, y, paired = T)


########## Two sample t test ##########


t.test(bmi ~ bmi_cate)


###### logistic regression


fvc = data[,12]; fvc

#fev1 = data[,13]; fev1

logit.model <- glm(bmi_cate ~ wt + fvc, data = data, family = 'binomial')

summary(logit.model)


##### ANOVA 

max(data$age) 

min(data$age)

hist(age, breaks = 20)


data$Category[data$age >= 20 & data$age <24] = "A" 

data$Category[data$age >= 24 & data$age <34] = "B" 

data$Category[data$age >= 34 & data$age <40] = "C" 

data$Category[data$age >= 40 & data$age <46] = "D" 

data$Category[data$age >= 46 & data$age <52] = "E" 

data$Category[data$age >= 52] = "F" 

View(data)

ncol(data)

colnames(data)[21] = "age_cate"

str(data)

B = bartlett.test(bmi~age_cate, data = data)


if(B$p.value>0.05)

  print("Variance between the groups are equal")


attach(data)

tapply(bmi, age_cate, mean)

p= aov(bmi ~ age_cate, data=data)

summary(p)


### post hoc test is performed to find the significant difference 

TukeyHSD(p)

7. Non parametric test:

Sometime it may happen that the data fails to show its normality character. Then no need to worry because there also exist several non parametric test to pursue your data set analysis. The following code will demonstrate this issue. 

setwd("C:/Users/USER/Dropbox/ASSAM") ## setting the working directory

trial.data = read.csv("example_data.csv")

View(trial.data)


bmi.1 = trial.data$bmi

weight.1 = trial.data$wt

height.1 = trial.data$ht


shapiro.test(bmi.1)


shapiro.test(weight.1)



### Since our data set does not follow the normality so we should

### follow any non parametric regression

### smooth.spline


fit.spline = smooth.spline(height.1, weight.1, spar = 0.7)

plot(height.1, weight.1, xlab = expression(bold(Height)), ylab = expression(bold(Weight)))

lines(fit.spline, col = "red", lwd = 2)

summary(fit.spline)

8. Chernoff's face: 

A novel method of representing multivariate data is presented. Each point in k-dimensional space is represented by a cartoon of a face whose features, such as length of nose and curvature of mouth, correspond to components of the point. Thus every multivariate observation is visualized as a computer-drawn face. This presentation makes it easy for the human mind to grasp many of the essential regularities and irregularities present in the data. The basic thing is that the author considered several facial aspect to express the multivariate data. We will let you know about the use of the Chernoff face in case of the anthropometric data with the help of R software. 

setwd("C:/Users/USER/Dropbox/ASSAM") ## setting the working directory

trial.data = read.csv("example_data.csv")

str(trial.data)

View(trial.data)


##### install the following package to run the Chernoff's face

install.packages("aplpack")

library(aplpack)

faces()

faces(trial.data[1:20, c(1,5:6)])