DST SERB High-End Workshop on application of statistics and machine learning in environmental research

(4th March-10th March, 2024)

The workshop is organized by the Indian Statistical Institute, Tezpur campus. Although, the course had run for a week, I and Prof. Sabyasachi Bhattacharya were engaged only in two days lecture. The main motto of this workshop is to refresh/update the knowledge of statistics in collection with the idea of Machine Learning. Now a days the analysis of any data sets require a good prior knowledge in any software. The most popular statistical software in this domain is the R software. The main reason behind the popularity of this software is that it is free of cost and you can easily download this software from the provided link (R software). Apart from this another reason is the versatility i.e. this software is very much useful in every domain.

Introduction with R software:

It was begun in the 1990s by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland. Nearly 20 senior statisticians provide the core development group of the R language, including the primary developer of the original §language, John Chambers, of Bell Labs. It contains a large, coherent, integrated collection of intermediate tools for data analysis. The software is FREE and can be downloaded from http://www.r-project.org/ or you can download from . The versatility of this software is that it can be used in every domain and the coding structure in this software is quite easy rather than the others. Moreover, R software is used not only for coding purpose, you can also prepare any manuscript or prepare any sorts of presentation in this software. R markdown will help in this case. The major yardstick of the software is the R packages. Basically, Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. Currently, the CRAN package repository features 10964 available packages.https://cran.r-project.org/web/packages/. To install packages type install.packages(pkgs, dependencies=TRUE) in the console.

In most of my research work i have used this software. Moreover, for any sorts of discrepancy do not hesitate to contact me. The contact details is provided in my home page. Don't worry this page will be update as per your needs.

References:

An introduction to R, Longhow Lam. (Link for the material https://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf )
Applied Statistical Inference: Likelihood and Bayes by Leohard Held and Daniel Sabanes Bove, Springer-Verlag Berlin 2014
The R Student Companion, Brian Dennis, CRC Press, 2013.
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie and Tibshirani, Springer Text in Statistics 2013
Using R for Numerical Analysis in Science and Engineering by Victor A. Broomfield, CRC Press. Taylor and Francis Group 2014
A Primer of Ecology with R by M. Henry and H. Stevens, Springer 2009
Statistical Modeling: The Two Cultures by Leo Breiman, Statistical Science 2001, Vol. 16, No. 3, 199-231.
The Art of R Programming; Norman Matloff
AN R COMPANION FOR THE HANDBOOK OF BIOLOGICAL STATISTICS; SALVATORE S. MANGIAFICO : https://rcompanion.org/documents/RCompanionBioStatistics.pdf

8th March 2024:

Initially Prof. Bhattacharya has given a basic lecture on the statistics. It is very important to learn the basic statistical aspect to analyse the data which we would collect. In this regard, we would like to apply our statistical knowledge through the R software. The following codes presented here will help you to learn the R software first and then the application of statistics with the R software. The first part of the following i.e. the "Basic Understanding the sampling distribution with the help of R software" is nothing but just an understanding about your theoretical statistical gatherings with the R software. No need to worry about it, because after this we will let you know about the R software from very basic part.

1. Basic Understanding of the sampling distribution with the help of R software:

set.seed(897)

############# Data

Marks = rnorm(1000, 65, 10)

Marks = round(Marks, 2)

Marks

#######graphical representation

plot(Marks, xlab="students no.", ylab="Marks")

######frequency distribution

par(mfrow=c(2, 2))

hist(Marks)

#v = seq(min(Marks), max(Marks), length.out = 25)

#hist(Marks, breaks = v)

#prob=T)

#lines(density(Marks, kernal="gaussian"))

##########sample of size 100

sample100=sample(Marks, 100)

sample100

######sample frequency distribution

hist(sample100, col="red")

##########sample of size 500

sample500=sample(Marks, 500)

sample500

######sample frequency distribution

hist(sample500, col="blue")

## Mean analysis

mean(Marks)

mean(sample100)

mean(sample500)

##########sample of size 100 100 times

samplecomb=matrix(NA, 100, 100)

for (i in 1:100)

{

samplecomb[i, ]=sample(Marks, 100)

}

dim(samplecomb)

samplecombrwmean=rowMeans(samplecomb)

hist(samplecombrwmean, col="green", main = "",

xlab = "Class boundary based on repeated sample mean")

mean(Marks)

mean(sample100)

mean(sample500)

mean(samplecombrwmean)

########## More and more sample

set.seed(989)

sample10.mean = numeric(5)

for(i in 1:5)

{

sample10=sample(Marks, 10)

sample10.mean[i] =mean(sample10)

}

sample10.mean

#hist(sample10.mean)

mean(sample10.mean)

par(mfrow = c(2,2))

plot(sample10.mean)

abline(h = 65, col = "blue", lty = 2)

var(sample10.mean)

#############

sample25.mean = numeric(5)

for(i in 1:5)

{

sample25=sample(Marks, 25)

sample25.mean[i] =mean(sample25)

}

sample25.mean

plot(sample25.mean)

abline(h = 65, col = "blue", lty = 2)

var(sample25.mean)

#########

sample50.mean = numeric(5)

for(i in 1:5)

{

sample50=sample(Marks, 50)

sample50.mean[i] =mean(sample50)

}

sample50.mean

plot(sample50.mean)

abline(h = 65, col = "blue", lty = 2)

var(sample50.mean)

########################

sample75.mean = numeric(5)

for(i in 1:5)

{

sample75=sample(Marks, 75)

sample75.mean[i] =mean(sample75)

}

sample75.mean

plot(sample75.mean)

abline(h = 65, col = "blue", lty = 2)

var(sample75.mean)

2. Methodology to deal with small sample size

It is quite often observed that, whenever we are going to collect the sample, the size of the observations are very small. The less amount of sample size often produce erroneous/biased inferences for its population. But, due to the logistic issue, experimenters are compelled to collect small amount sample. So, the following code will demonstrate the statistical process to deal with small level of sample sizes. The idea is very simple and intuitive. We have to apply the scheme of sampling distribution through the large number of repetion. The famous concept in this domain is Bootstrap. The third section of the following code represent that part but the first two sections are the mock simulation experiment.

set.seed(897)

############# Data

Marks = rnorm(1000, 65, 10)

Marks = round(Marks, 0)

Marks

l = 1000

sample10.mean = numeric(l)

for(i in 1:l)

{

sample10=sample(Marks, 10, replace = T)

sample10.mean[i] =mean(sample10)

}

sample10.mean

hist(sample10.mean)

mean(sample10.mean)

##### Sampling scheme with replacement

sample10.mean.wr = numeric(l)

for(i in 1:l)

{

sample10.wr=sample(Marks, 10, replace = T)

sample10.mean.wr[i] =mean(sample10.wr)

}

sample10.mean.wr

hist(sample10.mean.wr)

mean(sample10.mean.wr)

mean(sample10.mean)

#### Bootstrap

x = c(1, 7, 2, 3, 4,9, 11) # sample vector

B = 1000 # Number of bootstrap samples to be generated

out = matrix(NA, nrow = B, ncol= length(x))

b_mean = numeric(B)

b_sd = numeric(B)

for(i in 1:B) {

out[i,] = sample(x, size = length(x), replace = TRUE)

b_mean[i] = mean(out[i,])

b_sd[i] = sd(out[i,])

}

hist(b_mean)

mean(b_mean) # Bootstrap mean of sample mean

mean(x)

sd(b_mean) # Bootsrap standard error

b.ci.mean = quantile(b_mean, c(2.5, 97.5)/100)

3. Data structure in R software:

Please download the Data before the commencement of session. This session is devoted to the primary understanding of the data structure with the help of R software.

data = read.csv(file.choose())

View(data)

sbp = data$sbp

sm = summary(sbp)

b1 = boxplot(sbp, horizontal = T)

b1$out

dbp = data$dbp

data.md = data.frame(sbp, dbp)

View(data.md)

boxplot(data.md)

mean(sbp)

var(sbp)

sd(sbp)

par(mfrow = c(2,2))

pos = which(sbp>100)

sbp.gt.100 = sbp[pos]

h2 = hist(sbp.gt.100, probability = T)

length(sbp.gt.100)

h2$breaks

break.md = c(100, 130, 150, 180, 200, 240)

h3 = hist(sbp.gt.100, breaks = break.md, probability = F)

h4 = hist(sbp.gt.100, probability = F)

pos1 = which(sbp<=100)

sbp.less.equal.100 = sbp[pos1]

h1 = hist(sbp.less.equal.100, probability = T)

length(sbp.less.equal.100)

h1$breaks

sbp

dbp

plot(sbp, dbp)

c = cor(sbp, dbp)

if(c>0.5)

print("The correlation is strongly positive")

l = length(sbp)

x = numeric(l-1)

length(x)

for(i in 1:length(x))

x[i] = log(sbp[i+1]) - log(sbp[i])

cor(x, dbp)

cor(x, dbp[-1])

cor(sbp, dbp)

par(mfrow = c(2, 2))

plot(x, dbp[-1])

plot(sbp, dbp)

plot(sbp, dbp, col = "red")

plot(sbp, dbp, col = "red", lwd = 2)