Simulation
Monte Carlo Integration
Suppose that we want to perform the integration of the function f(x) = x^2 exp(-2x) in the interval (0,6)
my_fun = function(x){ # declaration of the function
(x^2)*exp(-2*x) # body of the function
}
# It is always a good idea to plot the function
par(mfrow =c(1,1)) # figure plotting windows
curve(my_fun(x), 0, 6, col = "red", lwd=2)
# Other way to plot it (Mainly for Python user)
x = seq(0, 6, length.out = 100)
print(x)
val_fun = my_fun(x) # evaluate the function at x values (mesh)
plot(x, val_fun, col = "red", type = "l", lwd=2)
# Now let us consider doing the integration
# Use of inbuilt function.
val_integral = integrate(my_fun, lower = 0, upper = 6)
# Note: The function integrate is used to perform
# integration in R
print(val_integral) # the value can be considered exact
# let us do the same integral by using simulation
# We shall write down the integral as an expectation of
# some random variable
# So, we can consider X as a uniformly distributed
# random variable over the interval (0,6). Its density function is
# f(x) = 1/6, 0<x<6
# So we are able write down the integral as an expectation of th
# random variable X^2e^(-2X), where X follows U(0,1)
#
# So the problems boils down to approximating the expectation
# of a random variable g(X) = X^2e^(-2X), where X follows
# U(0,1)
# Weak Law of Large Numbers
# If X_1, X_2, ... X_n is a sample of size n from some population
# with mean value mu. Then sample mean (X_1+X_2+...+X_n)/n
# converges to population mean mu in probability.
# Note 1: The sample mean is a random variable.
# But as n tends to infinity, this random quantity acts like a
# constant. The constant is nothing but the population mean
# (mu).
# In our problem, we have random variable g(X) and we
# want to evaluate E(g(X)). So if, we simulate a large number
# of samples from the distribution of g(X) and average them
# then the average value should be close to the E(g(X)), by
# Weak Law of Large Numbers.
# However, the distribution of g(X) may not be known. So we
# will simulate random numbers from the distribution of X
# (uniform(0,6)). And will evaluate the function g(.) at
# those values
n = 1000 # number of observations to be simulated from U(0,6)
x_sim_val = runif(n, min = 0, max = 6)
# It is always a good idea to check the codes immediately
# after you write it.
print(x_sim_val)
hist(x_sim_val, probability = TRUE, col = "grey")
g = my_fun
g_sim_val = g(x_sim_val)
g_sim_val
hist(g_sim_val, probability = TRUE, col = "grey")
approx_integral = 6*sum(g_sim_val)/n
approx_integral
# What we have done: We have approximate the integral
# by using the Monte Carlo integration method. It is very
# important to mention that based on how many number of
# simuated observations, you have computed the value of the
# integral. In this case, we have n=1000
# As a programmer, you should always explore and try things
# Had we consider n= 50, would we get the same accuracy?
# It is important to see the accuracy of the approximation
# as a function of the sample size n
# We want to evaluate the integral as function of n as well
# We want to do the same exercise for different values of n
# We can use an array for values of n and run a loop
n_vals = 1:1000 # an array containing 1,2,3,...,1000
print(n_vals)
approx_integral = numeric(length = length(n_vals))
for (n in n_vals) {
x_sim_val = runif(n, min = 0, max = 6)
g_sim_val = g(x_sim_val)
approx_integral[n] = 6*sum(g_sim_val)/n
}
print(approx_integral) # important
# We will write the integral as a function of n
plot(n_vals, approx_integral, type = "l", col = "red", lwd=2)
abline(h = val_integral$value, col = "grey", lwd=2)
# How to write a legend nicely
legend("topright", legend = c("monte carlo integral", "using integrate()"),
col = c("red", "grey"), lwd=c(2,2))
# We can see that as n increases the accuracy of the monte
# carlo integration increases.
# What we have learnt is that any definite integral
# can be evaluated using Monte Carlo integration. The key
# step is that we need to write the integral as an expectation
# and use the property that the sample mean converges to the
# population mean (in probability ) as n increases.
# Exercise:
# (a) Integrate x^2e^(-2x) in the interval (0, infty)
# (b) Let us suppose that I_n is the value of the integral
# based on a sample of size n. Thus, as the sample values
# changes, the value of I_n also changes. Essentially,
# the quantity I_n is a random variable, not a fixed number.
# Show that the expectation of I_n is I (the exact Integral)
# and Compute the variance of I_n using simulation. Visualize
# the shape of the SAMPLING DISTRIBUTION of I_n using
# 1000 replicatoin. Do this for n=10, n= 30, n= 50, n=100
# in the same graph.
Statistical Simulation: Lecture - I (Date: 12th January 2018)
There are n persons and their hats are numbered as 1, 2, 3,...,n. Suppose they throw the hat in the center of the room and asked to randomly pick one hat. We are interested to compute the probability that at least one person will get his own hat. This problem is discussed last semester in the class and appeared in examination too. The probability of at least one correct assignment of a hat is 1 - (1-1/n)^n. The limit as n tends to infinity is 1-1/e. Let us verify this claim using Statistical Simulation. We are planning to do: we randomly select n numbers from the set {1, 2, ..., n} by simple random sampling without replacement, so that all numbers will be selected in every experiment. This experiment can be performed using R by using the sample(1:n, size = n, replace = FALSE). The replace = FALSE indicates that once a number is selected it will not appear again. Use set.seed(123) function so that the same answer can be obtained for all.
n = 10 # Number of hats
x = 1:n # Reference number to checked for each experiment
m = 1000 # Number of experiments
out = matrix(data = NA, ncol = n, nrow = m) # Experimental outcomes to be stored in out
for(i in 1:m){
out[i, ] = sample(x, size = length(x), replace = FALSE)
}
s = numeric(m) # Storing the number of success
for(i in 1:m){
if(sum(x==out[i,])>0)
s[i] = 1
}
prob = sum(s)/m
print(prob) # Probability of at least one success
Exercise: Write an R program to perform the above experiment with varying n, the number of hats. Plot the probability of at least one matching (obtained from simulation) with respect to n.
Solution:
n = 1:50 # number of persons
m = 10000 # number of times experiment to be performed to estimate the probability of at least on matching
prob_exact = 1-(1-1/n)^n
prob_sim = numeric(length = length(n))
for(i in 1:length(n)){
x = 1:n[i]
out = matrix(data = NA, ncol = n[i], nrow = m)
for(j in 1:m){
out[j, ] = sample(x, size = length(x), replace = FALSE)
}
s = numeric(m) # Storing the number of success
for(j in 1:m){
if(sum(x==out[j,])>0)
s[j] = 1
}
prob_sim[i] = sum(s)/m
}
plot(n, prob_sim, type = "l", col = "red", ylim = c(0.4,1), lwd=2)
lines(prob_exact, type= "l", col = "blue", lwd=2)
legend("topright", legend = c("empirical prob.", "exact prob."), lwd=c(2,2), col=c("red","blue"), bty="n")
While solving the above problem, I felt that a short introduction to R is required, hence the following codes have been developed during lecture and the basic things were explained. These will be rigorously covered in your Software Lab-II course. This page is only for statistical simulations. Find your valuable time to run these codes and understand their functionalities for a better understanding of R Programming. We shall explicitly write programs, so it is better you get familiar with vector, matrix, data.frame, list in R.
In the following we evaluate the probability that X lies between 0 to 1. The distribution of X is normal with mean 1 and sigma 1. The required area under the curve is done by means of simulation.
mu = 1
sigma = 1
fun = function(x){
(1/(sqrt(2*pi)*sigma))*exp(-(x-mu)^2/(2*sigma^2))
}
prob = integrate(fun, 0, 1)$value
curve(fun(x), -3, 5, col = "red", lwd=3)
segments(0,0,0,fun(0), lwd=2)
abline(h=0, lwd=2)
segments(1,0,1,fun(1), lwd=2)
rep = 100
count = numeric(rep)
for(i in 1:rep){
x = rnorm(i, mean = mu, sd = sqrt(sigma))
count[i] = sum(x>0 & x<1)/i
}
plot(1:rep, count, col = "red", pch = ".")
abline(h = prob, col = "green", lwd=2)
Lecture - III (Date: 17th January 2018)
We start our discussion with some plot options. Few students had doubts about some plot options. First, we clarify those, in particular, type = option and how to put legends in a plot. The code which is developed during the lecture follows: Concepts are self-explanatory. Use help(plot), help(legend), help(lines), help(points), help(seq) to know more about those functions.
x = seq(1, 10 ,length.out = 100)
print(x)
y = sin(x)
z = cos(x)
plot(x, y, col = "red" )
points(x, z, col = "blue")
plot(x, y , col = "red", type="l")
lines(x, z, col = "blue", type = "l")
plot(x, y , col = "red", type="b", lwd =2)
lines(x, z, col = "blue", type = "b" ,lwd=2)
# Adding legend
plot(x, y , col = "red", type="l", lwd =2, xlim = c(1, 10),
ylim=c(-1,1.5))
lines(x, z, col = "blue", type = "l", lwd=2)
legend("topright", legend = c("sin(x)","cos(x)"),
col = c("red", "blue"), lwd =c(2,2), bty = "n")
Then we have recalled what we have done in the previous lecture. In the previous lecture, we have written a function generate_exp() to generate n random numbers from the exponential distribution with a fixed rate parameter lambda. We do the process for the logistic(mu, beta) distribution. Note that for the logistic distribution mu is the location parameter and beta is the scale parameter. It is important to introduce the concept of location and scale family.
Location Family: Let f(x) be any probability density function. Then the family of probability density functions f(x-mu), indexed by the parameter mu, -inf < mu < inf, is called location family with standard pdf f(x) and mu is called the location parameter for the family.
Scale Family: Let f(x) be any probability density function. The for any sigma>0, the family of probability density function (1/sigma)f(x/sigma), indexed by the parameter sigma, is called the scale family with standard pdf f(x) and sigma is called the scale parameter of the family.
Location-Scale Family: Let f(x) be any probability density function. The for any mu, -inf < mu < inf, and any sigma>0, the family of probability density functions (1/sigma)f((x-mu)/sigma), indexed by the parameter (mu, sigma), is called the location-scale family with standard pdf f(x); mu is called the location parameter and sigma is called the scale parameter.
Recall that in the first semester we have studied the logistic distribution with parameter mu and beta. The logistic density function is given by f(x) = (1/beta)*(exp(-(x-mu)/beta))/(1 + exp(-(x-mu)/beta))^2, -inf < mu < inf, beta > 0. Note that the logistic family is a location-scale family with the location parameter mu and scale parameter beta. Now we shall write our own function to generate random numbers from the logistic distribution with parameter mu and sigma. Since the distribution function can be explicitly computed from the logistic density function, it is easy to verify the probability integral transform method. Explicit computation (done during the lecture) provides the logistic distribution function as F(x) = 1/(1 + exp(-(x-mu)/beta)). So if u is a uniform random number from [0,1] then the inverse of F at u will give a logistic random variable. Inv(F)(u) = mu + beta * log(u/(1-u)). The following function generate_log(n, mu, beta) utilizes this fact to generate random numbers from logistic(mu, beta) distribution. We also compare our results with the generator rlogis() in R by means of histograms. The corresponding probability density function is overlayed on the histogram for verification.
generate_log = function(n=10, mu=1, beta=2){
n = n # number of random values
mu = mu # location parameter
beta = beta # scale parameter
if(beta <= 0){
return("Invalid scale parameter")
}
u = runif(n, min = 0, max = 1)
x = mu + beta * log(u/(1-u))
return(x)
}
par(mfrow = c(1,2))
mu = 5
beta =3
x = generate_log(n=10000, mu=5, beta = 3)
hist(x, probability = TRUE, ylim = c(0,0.08),
main = "Generate using Inv tranform")
logistic_pdf = function(x){
(1/beta)*(exp(-(x-mu)/beta))/(1+exp(-(x-mu)/beta))^2
}
curve(logistic_pdf(x), -20, 30, lwd=2,
col = "red", add = TRUE)
y = rlogis(n = 10000, location = 5, scale = 3)
hist(y, probability = TRUE, main = "Generate using R")
# Checking for probability density function
exp_pdf = function(x){
exp(-x)
}
integrate(exp_pdf, 0, Inf)
f= function(x){
x^2
}
integrate(f, -1, 1)
Convergence in Probability: This is a very important concept in statistics and plays a very crucial role in statistical simulation. In the lectures, we have already checked how to show the convergence in probability (recall the use of Chebychev's inequality). We shall develop R codes to visualize this concept. There is one magical theorem, we studied in the class, is called the Weak Law of Large Numbers (WLLN). WLLN states that, suppose we generate a random sample of size n from a population distribution which has a finite mean (mu, say) and finite variance, then as n increases the sample mean converges to the population mean (mu) in probability. In layman language, it means that for large n, the sample mean values will be highly concentrated about mu. In the following code, we visualize this fact by simulating random numbers from the uniform distribution in the interval [a,b]. The mean (mu) = (a+b)/2 and var = (b-a)^2/12. So the population distribution has finite mean and finite variance, hence WLLN can be applied. For each n we generate n uniform random numbers from [a,b] and compute the sample mean bar(x_n). Plot n versus bar(x_n) for a range of n values.
par(mfrow= c(1,1))
a = 1; b = 10
n = 10000 # sample size
x_mean = numeric(n)
for(i in 1:n){
x = runif(i, min = a, max = b)
x_mean[i] = mean(x)
}
plot(1:n, x_mean, type= "p", col = "red")
abline(h = (a+b)/2, col = "blue") # mean values of are concentrated about this line
We can visualize the same thing in terms of distribution as well. Basically, the random variables bar(X_n) will be asymptotically degenerate and mu. So here, we plot the distribution of the sample mean for different sample sizes. Please understand the logic of each loop. First, visualize them and draw them on a piece of paper and then start writing the code. Understand the graphical output clearly and match with the lecture notes (theory class).
n = c(5, 20, 100, 500)
m = 1000 # number of experiments to generate the distribution of bar(X_n)
a = 1; b =5
par(mfrow=c(2,2))
for(i in n){
x_mean = numeric(m)
for(j in 1:m){
x = runif(i, min = a, max = b)
x_mean[j] = mean(x)
}
hist(x_mean, probability = TRUE,
xlim=c(a,b), main = paste("n = ", i))
abline(v = (a+b)/2, col = "red", lwd=2)
}
Exercise: In the above exercise different histograms are plotted in different plotting regions for different values of n. Can we combine all the histograms in a single plot having different colors? Add a legend to this plot for different values of n with matching colors.
Statistical Simulation (Lecture-IV Date: 22nd January 2018)
The agenda of our lecture is to visualize the concept of convergence in distribution. Before going to that let us refresh ourselves with a problem on convergence in probability. In the class, a homework was given to you, where you are asked to show that if X_1, X_2,..., X_n are a random sample from uniform(0, theta), then X_(n) = max(X_1,...X_n) converges to theta in probability as n tends to infinity. Here we shall visualize this fact. To make it more general, we generated the uniform random numbers from the interval [a,b] and computed the quantities x_(n) and x_(1) and plotted them for increasing number of sample size. x_(1) will be highly concentrated about a and x_(n) will be highly concentrated to b for a large number of sample sizes (n large).
par(mfrow=c(1,1))
a= 3; b = 6
n = 1:1000
x_1 = numeric(length = length(n))
x_n = numeric(length = length(n))
for(i in 1:length(n)){
x = runif(i, min = a, max = b)
x_1[i] = min(x)
x_n[i] = max(x)
}
plot(n, x_1, col = "red", ylim = c(a-1, b+2), lwd=2,
ylab = "max(x) and min(x)", xlab = "sample size (n)")
points(n, x_n, col = "green", lwd = 2)
legend("topright", legend = c("max(x)", "min(x)"),
lwd = c(2,2), col = c("green", "red"), bty= "n")
Convergence in Distribution: In continuation of the previous problem, we have also seen that the quantity n(1-X_(n)) converges in distribution to the exponential random variable with mean=1. X_(n) = max(X_1,..X_n) and the random numbers are generated from the uniform[0,1] distribution. Let us check whether the distributions are converging to exponential or not. The algorithm works in the following way: Let us fix n = 4, then generate m = 1000 many samples of size n. For each m, we have 5 uniform random numbers from [0,1]. Compute the maximum of this five numbers (x_(n), say) and evaluate n(1-x_(n)). This will be a realization of the random variable n(1-X_(n)) [Note the difference between the use of small x and capital X]. Now draw the histogram of these values. Do this process for n =10, 15, 20 and see that the histograms will be closer to the exponential density function (with lambda = 1).
n = c(5, 10, 20 ,100)
m = 1000
par(mfrow=c(2,2))
for(i in n){
x = matrix(NA, nrow = m, ncol = i)
y = numeric(m)
for(j in 1:m){
x[j,] = runif(i, min =0, max=1)
y[j] = i*(1 -max(x[j,]))
}
hist(y, probability = TRUE, col = "grey", main = paste("n=",i))
curve(exp(-x), add= TRUE, col = "red")
}
Central Limit Theorem: One of the remarkable theorem in statistics. It says that if we have independent and identically distributed random variables following some distribution having a finite mean (mu) and finite variance (sigma^2), the distribution of the random variable Y_n = [sqrt(n)(bar(X_n)-mu)/sigma] can be approximated by the standard normal distribution function. That is Y_n converges to N(0,1) in distribution as n becomes large. Recall that while proving the theorem we have assumed that the moment generating function exists for the population distribution. This may not be the case in general. In the following code, we have demonstrated the central limit theorem using random samples from the uniform[a,b] distribution. Experiment using some other distribution functions like exponential, binomial, Poisson etc.
a = 2
b = 5
mu = (a+b)/2
sigma = (b-a)/sqrt(12)
m = 1000
par(mfrow=c(2,2))
for(n in 1:4){
y = numeric(m)
for(i in 1:m){
x = runif(n, min = a, max = b)
y[i] = sqrt(n)*(mean(x)-mu)/sigma
}
hist(y, probability = TRUE, main = paste("n=",n), xlim = c(-3,3))
curve(expr = (1/sqrt(2*pi))*exp(-x^2/2), col = "red", lwd =2, add = TRUE)
}
Statistical Simulation (Lecture-V Date: 30th January 2018)
Let us recall the utility of statistical simulation. The first one is that we are able to generate a large number of artificial observations from a theoretical probability distribution. Simulation can be used to evaluate the performance of statistical methods which is otherwise not possible using real data as the model parameters are not known. It also plays a very important role to estimate the probability distribution of a random variable whose theoretical distribution is difficult to obtain. We have also verified some probability questions using simulation. Today, we shall mainly concentrate on the simulation of random numbers following some discrete distribution.
Suppose we are interested to generate a random number from the binomial(n,p) distribution. Suppose U_1, U_2,...U_n are independent uniform(0,1) random numbers. For each i = 1 to n, if (U_i <=p), set X_i = 1, otherwise X_i = 0. Then X = X_1 + X_2 +...X_n is an observation from the binomial(n,p) distribution.
Suppose we are interested to generate a random number from the geometric(p) distribution. Go on generating U_1, U_2,... from uniform(0,1) distribution and assign X = min{i, U_i <=p}. Then X follows geometric(p) distribution. Since the expected value of X is 1/p, this implies X is likely to be large if p is very small. An alternative method to generate a geometric(p) random variable is that if U follows uniform(0,1) distribution, then X = [ln(U)/ln(1-p)]+1 is a random variable with geometric(p) distribution.
Suppose we are interested to generate a random number from the Poisson(lambda) distribution. Go on generating U_0, U_1, U_2,... from uniform(0,1). If U_0 <= exp(-lambda), X=0; else X = min{i: U_0*U_1*...*U_(i-1) > exp(-lambda) and U_0*U_1*...*U_(i-1)*U_i <=exp(-lambda)}. Then X follows a Poisson(lambda) distribution.
Statistical Simulation: (Lecture - VI Date: 10th February 2018 )
x = c(1, 7, 2, 3, 4,9, 11) # sample vector
B = 1000 # Number of bootstrap samples to be generated
out = matrix(NA, nrow = B, ncol= length(x))
b_mean = numeric(B)
b_sd = numeric(B)
for(i in 1:B) {
out[i,] = sample(x, size = length(x), replace = TRUE)
b_mean[i] = mean(out[i,])
b_sd[i] = sd(out[i,])
}
mean(b_mean) # Bootstrap mean of sample mean
sd(b_mean) # Bootsrap standard error
b.ci.mean = c(quantile(b_mean, c(2.5, 97.5)/100))
mean(b_sd) # Bootstrap estimate of population standard deviation
sd(b_sd) # Bootstrap standard error of population sd
b.ci.sd = c(quantile(b_sd, c(2.5, 97.5)/100))
par(mfrow = c(1,2))
hist(b_mean, probability = TRUE, col = 2, main="Sample Mean (bootstrap)")
hist(b_sd, probability = TRUE, col = 2, main = "sample SD (bootstrap)")
# Bootstrapping a Linear Regression model
p = 2 # The number of predictors
b0 = 1 # Assumed value of the intercept
b1 = 2 # Assumed coefficient of X1
b2 = 3 # Assumed coefficient of X2
sd = 0.2 # Assumed population standard deviation of eps
n = 30 # Number of data points
set.seed(1)
X1 = runif(n, 1, 10)
X2 = runif(n, 1, 10)
Y = numeric(n)
Y = b0 + b1*X1 + b2*X2 + rnorm(n, 0, sd = sd) # Generate the response variable
print(Y)
X = cbind(rep(1, n), X1 , X2) # Create the design matrix
print(head(X, n =7)) # Print first seven rows of the data matrix
bhat = solve(t(X)%*%X)%*%t(X)%*%Y # Parameter estimates
print(bhat)
# Bootstrapping the regression model
m = 1000
b_hat = matrix(data =NA, nrow = m, ncol = 3)
for (i in 1:m) {
ind = sample(1:n, size = n, replace = TRUE)
bX = X[ind, ]
bY = Y[ind]
b_hat[i, ] = solve(t(bX)%*%bX)%*%t(bX)%*%bY
}
head(b_hat)
# Bootstrap distribution of the estimators of the regression model parameters
par(mfrow = c(2,2))
hist(b_hat[,1], probability = TRUE, xlab = "b0", main = "")
hist(b_hat[,2], probability = TRUE, xlab = "b1", main = "")
hist(b_hat[,3], probability = TRUE, xlab = "b2", main = "")
# Demonstration of Delta Method
# Determination of the confidence interval for the proportion. # Consider the problem of estimating the population proportion of a bin(1,p) population. We draw a random sample of size n and store the sample mean. Continue this process a large number of times so that the sampling distribution of the sample mean can be visualized by means of a histogram. It is interesting to see that the histogram looks like a normal distribution. However, this is not a surprising output (thanks to the Central Limit Theorem).
n_sample = c(5, 10, 15, 20) # number of sample to be drawn
rep = 1000 # number of replication
p = 0.3 # true proportion in the population
par(mfrow=c(2,2))
for(n in n_sample){
mat = matrix(NA, nrow = rep, ncol = n) # matrix to hold the observations
for( i in 1:rep){
mat[i, ] = rbinom(n, 1, p)
}
Let us check the histrogram of each column. Note that each column contains sample from a biomial(1,p) population. Let us compute the sampling distribution of the sample proportion
p_hat = numeric(rep)
for(i in 1:rep){
p_hat = rowMeans(mat)
}
# p_hat is an unbiased estimator. So average of p_hat values should be close to p (true)
mean(p_hat) # which is close to the population proportion
# theoretical sd of p_hat is sqrt(p(1-p)/n)
sd_p_hat = p*(1-p)/n; sd_p_hat
# estimation of the sd of p_hat
est_sd_p_hat = sqrt(p_hat*(1-p_hat)/n)
mean(est_sd_p_hat) # usually we have only a single sample of size n
# approximation by normal distribution
hist(p_hat, col = "grey", prob = T, main = paste("n=", n), xlab = expression(hat(p)), ylim =c(0,4))
curve(dnorm(x, mean = mean(p_hat), sd = mean(est_sd_p_hat)), lwd=2, col = "red", add = TRUE)
}
Suppose that we observe X_1, X_2,...X_n independent Bernoulli(p) random variables. In the previous code, we were interested to estimate the success probability p, but another popular parameter is p/(1-p), the odds. For example, if the data represent the outcomes of a medical treatment with p=2/3, then the person has odds 2:1 of getting better. Moreover, if there were another treatment with success probability r, then the odds ratio [p/(1-p)]/[r/(1-r)] is of interest that gives the relative odds of one treatment over another. Typically estimate of the success probability p is p_hat = (X_1 + X_2+...+X_n)/n. We might consider using p_hat/(1-p_hat) as an estimate of p/(1-p). Then what are the properties of this estimator? How might we estimate the variance of p_hat/(1-p_hat)? How can we approximate the sampling distribution of p_hat/(1-p_hat)? The following code demonstrates the use of delta method in approximating the sampling distribution of the estimator.
par(mfrow=c(1,1))
set.seed(123)
p = 0.3 # population proportion (always unknown)
n_sample = c(10, 20, 30, 50) # sample size
for(n in n_sample){
x = rbinom(n, 1, p) # generate a sample of size n
p_hat = mean(x)
m = 500 # number of bootstrap replication
mat = matrix(NA, nrow = m, ncol = n)
for(i in 1:m){
mat[i, ] = sample(x, size = n, replace = TRUE)
}
boot_mean = mean(rowMeans(mat))
hist(rowMeans(mat), prob = T, col = "grey", main = paste("n=", n), xlab = expression(g(hat(p)))) # bootstrap distribution of p_hat
boot_sd = sd(rowMeans(mat))
boot_ci = quantile(rowMeans(mat), c(0.025, 0.975)) # nonparameteric confidence interval
cat("The bootstrap confidence interval nonpatametric type \n")
print(boot_ci)
cat("The bootstrap confidence interval normal type\n")
print(boot_ci_normal)
cat("Theconfidence interval based on asymptotic normality\n")
print(ci_large_sample)
boot_ci_normal = c(boot_mean - 1.96*boot_sd, boot_mean + 1.96*boot_sd)
ci_large_sample = c(p_hat - 1.96*sqrt(p_hat*(1-p_hat)/n),p_hat + 1.96*sqrt(p_hat*(1-p_hat)/n))
abline(v = boot_ci, col = "red", lwd = 2)
abline(v = boot_ci_normal, col = "blue", lwd = 2)
abline(v = ci_large_sample, col = "green", lwd =2)
legend("topright", legend = c("nonparamertic CI", "normal CI", "large sample CI"), col = c("red","blue","green"), lwd=c(2,2,2), bty = "n")
curve(dnorm(x, mean = p_hat, sd = sqrt(p_hat*(1-p_hat)/n)), col ="green", lwd=2, add = TRUE)
}
Demonstration of Multivariate Delta Method: Ratio type estimator Xbar/Ybar for estimating mu_x/mu_y. This following program is mainly written by Rahul Pal and Dipali Mestry (M.Sc. students)
library(MASS)
sigma1=2
sigma2=3
rho=0.4
sigma=matrix(c((sigma1)^2,rho*sigma1*sigma2,rho*sigma1*sigma2,(sigma2)^2),2,2,byrow=T)
sigma
mu1=20
mu2= 10
param = mu1/mu2
rep = 10000
n = 1000
est_ratio = numeric(rep)
for( i in 1:rep){
mat = mvrnorm(n = n, mu = c(mu1,mu2), Sigma = sigma)
est_ratio[i] = mean(mat[,1])/mean(mat[,2])
}
hist(est_ratio, prob = T, col = "grey")
points(mu1/mu2, 0, lwd=2, pch ="*", col = "blue", cex = 3)
delta_var_true = (mu1/mu2)^2*((sigma1/mu1)^2 +(sigma2/mu2)^2 - 2*rho*sigma1*sigma2/(mu1*mu2))/n
print(delta_var_true)
var(est_ratio)
curve(dnorm(x, mean = mean(est_ratio), sd = sqrt(delta_var_true)), lwd=2, col = "red", add =TRUE)
mat = mvrnorm(n = n, mu = c(mu1,mu2), Sigma = sigma)
delta_var_est = (mean(mat[,1])/mean(mat[,2]))^2*((sd(mat[,1])/mean(mat[,1]))^2 +(sd(mat[,2])/mean(mat[,2]))^2 - 2*cor(mat[,1], mat[,2])*sd(mat[,1])*sd(mat[,2])/(mean(mat[,1])*mean(mat[,2])))/n
curve(dnorm(x, mean = mean(mat[,1])/mean(mat[,2]), sd = sqrt(delta_var_est)), lwd=2, col = "blue", add =TRUE)
Bivariate normal distribution. Typically mvrnorm() function from the MASS package is used to generate random sample from population whose distribution is characterized by a bivariate (or multivariate normal) distribution with mean vector and the covariance matrix (a symmetric positive definite matrix). However, in the following we use a well fact about the multivariate normal distribution to generate the data values.
mu_1 = 3; mu_2 = 4; sigma_1 = 1; sigma_2 = 2; rho = 0.4
n = 1000 #number of bivariate samples
x = numeric(n)
y = numeric(n)
# theorem says that given X=x, Y is normally distributed with mean rho*(sigma_2/sigma_1)*(x-mu_1) and sd sigma_2*sqrt(1-rho^2). This theorem can be used to generate data from a bivariate normal distribution
for(i in 1:n){
x[i] = rnorm(1, mu_1, sigma_1)
y[i] = rnorm(1, mu_2 + rho*(sigma_2/sigma_1)*(x[i]-mu_1),
sd = sigma_2*sqrt(1-rho^2))
}
hist(x, prob = T, col = "grey")
curve(dnorm(x, mu_1, sd = sigma_1), col = "red", lwd=2, add = TRUE)
hist(y, prob = T, col = "grey")
curve(dnorm(x, mu_2, sd = sigma_2), col = "red", lwd=2, add= TRUE)
library(hexbin)
plot(hexbin(data.frame(x,y)))
library(car)
dataEllipse(x,y, levels = c(0.975), col = "red")
# Relationship among the distributions!
# For large lambda, the Poisson distribution can be approximated by a normal distribution
lambda = c(1, 5, 10, 20)
n = 1000
par(mfrow=c(2,2))
for(lam in lambda){
x = rpois(n, lambda = lam)
hist(x, probability = TRUE)
}
# Shape of the distribution for range of p values for binomial distribution. Please note the changes in modes in the histogram
par(mfrow=c(2,2))
p = c(0.1, 0.3, 0.6, 0.9)
m = 30
n= 100
for(prob in p ){
x = rbinom(n, size =m, prob = prob)
hist(x, prob = TRUE, col = "grey", xlim=c(0,m))
}
# For small p and large m, the binomial(m,p) distribution can be approxiamted by Poisson(lambda = np) distribution
par(mfrow=c(2,2))
p = 0.01
m = c(20, 50, 100, 300)
n = 10000
for(size in m){
x = rbinom(n, size = size, prob = p)
hist(x, prob =T, xlim=c(0, 10), border="red" )
x = rpois(n, lambda = size*p)
hist(x, prob =T, xlim=c(0, 10), add= TRUE, border = "blue" )
}
# sum of n bernoulli(1,p) random variables follows bin(n,p) distribution. The code is mainly written by Keshava Rao (M. Sc. students) and I just did some modification to show the graph in a better way
n = 10 # number of trials in the coin tossing experiment
p = 0.5
rep=10000 # number of repititions of n trials
y=numeric(rep)
sum=0
par(mfrow=c(1,2))
x=matrix(NA,rep,n)
for (i in 1:rep) {
for (j in 1:n) {
x[i,j]=rbinom(1,1,p)
sum=sum+x[i,j]
}
y[i]=sum
sum=0
}
x
y
proportion = numeric(n+1)
for(i in 0:n){
proportion[i+1] = sum(y==i)/rep
}
par(mfrow=c(1,1))
plot(0:n, proportion, type= "p", col = "blue", pch="o")
points(0:n, dbinom(0:n, n, p),pch = '*', lwd=2, cex =2, col = "red")
legend("topright", c("Empirical", "Theoretical"), col=c("blue","red"), pch = c("o","*"),lwd=c(1,1) )