Normal distribution

The following is an R script that exercises and checks the confidence boxes for the normal distribution. It demonstrates that the c-boxes for the mean and standard deviation perform statistically in the sense that they produce a uniform map in a Singh plot. It also demonstrates that one can compute with confidence boxes by forming sums, differences, and―with certain limitations―products, quotients, minima, maxima, and powers. In particular, note that subtracting c-boxes for normal means solves the Behrens−Fisher problem. The results also have the confidence interpretation as shown by Singh plots. This script was created as a part of our quality assurance efforts for c-box software. You can run the script and see its output by sequentially pasting its sections (which are delimited by rows of pound signs #) into the R console.

<<THIS CODE IS CURRENTLY IN REVISION>>

############################################################################################ red

# NORMAL MEAN

# Gosset was correct...demonstrate how the scaled t-distribution is a confidence distribution,

# and thus a c-box, for the mean of a normal distribution

# c-box and confidence distribution for the normal mean

cd = function(z) {n = length(z); mean(z) + sd(z) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

some = 5000

many = 5000

mu = runif(1,-100,100)

sg = runif(1,0,100)

cat('normal(',mu,',',sg,')\n',sep='')

nn = c(2,3,4,5,6,10,100,1000,10000)

par(mfrow=c(3,3))

for (n in nn) {

u = array(0, some)

for( i in 1:some ){

x = rnorm( n , mu , sg )

u[i] = sum( cd(x) < mu )/many

}

plot( c(0,1) , c(0,1) , type='l' , col=2)

edf(u)

title(paste('n',n))

}

############################################################################################ red

# GOSSET FOR GIANT VARIANCES

n = 10

mu = 2

sgsg = c(0.0001,0.01,0.5,1,2,5,10,20,100)

many = 5000

some = 5000

cd = function(z) {n = length(z); mean(z) + sd(z) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

par(mfrow=c(3,3))

for (sg in sgsg) {

u = array(0, some)

for( i in 1:some ){

x = rnorm( n , mu , sg )

u[i] = sum( cd(x) < mu)/many

}

plot( c(0,1) , c(0,1) , type='l' , col=2)

edf(u)

title(paste('sigma',sg))

}

############################################################################################ red

# ADDITION OF NORMAL MEANS

f = function(a,b) a+b

cd = function(x) {n = length(x); mean(x) + sd(x) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

n1 = 8

n2 = 8

sigma1 = 1

sigma2 = 1

some = 2000

many = 2000

mus = c(0.0, 0.1, 0.2, 0.3, 0.4)

m = length(mus)

par(mfrow=c(5,5))

for (j in 1:m) for (k in 1:m) {

mu1 = mus[[j]]

mu2 = mus[[k]]

u = array(0, some)

for (i in 1:some) {

x = rnorm(n1, mu1, sigma1) # sample data

y = rnorm(n2, mu2, sigma2) # sample data

d = f(cd(x), cd(y))

u[i] = sum(d < f(mu1,mu2)) / many

}

plot(c(0,1),c(0,1),type='l', col='red')

edf(u)

title(paste('means',mu1,mu2))

cat(j, k, n1,mu1,sigma1,'plus', n2,mu2,sigma2,'\n')

}

############################################################################################ red

# SUBTRACTION OF NORMAL MEANS

# this is the Behrens-Fisher problem

# The Wikipedia article on the subject says:

# In statistics, the Behrens–Fisher problem, named after Walter Ulrich Behrens and

# Ronald Fisher, is the problem of interval estimation and hypothesis testing concerning

# the difference between the means of two normally distributed populations [and whether

# the means can reasonably be treated as equal] when the variances of the two populations

# are not assumed to be equal, based on two independent samples.

# Solutions to the Behrens–Fisher problem have been presented that make use of either

# a classical or a Bayesian inference point of view and either solution would be notionally

# invalid judged from the other point of view. If consideration is restricted to classical

# statistical inference only, it is possible to seek solutions to the inference problem that

# are simple to apply in a practical sense, giving preference to this simplicity over any

# inaccuracy in the corresponding probability statements. Where exactness of the

# significance levels of statistical tests is required, there may be an additional

# requirement that the procedure should make maximum use of the statistical

# information in the dataset. It is well known that an exact test can be gained by

# randomly discarding data from the larger dataset until the sample sizes are equal,

# assembling data in pairs and taking differences, and then using an ordinary t-test

# to test for the mean-difference being zero: clearly this would not be "optimal" in any

# sense.

# The task of specifying interval estimates for this problem is one where a frequentist

# approach fails to provide an exact solution, although some approximations are available.

# Standard Bayesian approaches also fail to provide an answer that can be expressed as

# straightforward simple formulae, but modern computational methods of Bayesian

# analysis do allow essentially exact solutions to be found. Thus study of the problem

# can be used to elucidate the differences between the frequentist and Bayesian approaches

# to interval estimation.

f = function(a,b) a-b

cd = function(x) {n = length(x); mean(x) + sd(x) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

some = 2000

many = 2000

n1 = round(runif(1,1,10))

n2 = round(runif(1,1,10))

cat('Sample sizes',n1,n2,'\n')

sigma1 = runif(1,0,10)

sigma2 = runif(1,0,20) # Behrens-Fisher does not assume equal variances

cat('Sigmas',sigma1,sigma2,'\n')

mus = c(0, 1, 5, 10, 50)

m = length(mus)

par(mfrow=c(5,5))

for (j in 1:m) for (k in 1:m) {

mu1 = mus[[j]]

mu2 = mus[[k]]

u = array(0, some)

for (i in 1:some) {

x = rnorm(n1, mu1, sigma1) # sample data

y = rnorm(n2, mu2, sigma2) # sample data

d = f(cd(x), cd(y))

u[i] = sum(d < f(mu1,mu2)) / many

}

plot(c(0,1),c(0,1),type='l', col='red')

edf(u)

title(paste('means',mu1,mu2))

cat(j, k, n1,mu1,sigma1,'plus', n2,mu2,sigma2,'\n')

}

############################################################################################ red

# Check that the IMPLEMENTATION of SUBTRACTION in R matches that in RAMAS Risk Calc

##############################

# RUN THIS PART IN R

##############################

f = function(a,b) a-b

cd = function(x) {n = length(x); mean(x) + sd(x) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

some = 2000

many = 20000

n1 = 5

n2 = 10

mu1 = 4

mu2 = 9

sg1 = 3

sg2 = 12

x = rnorm(n1, mu1, sg1) # sample data

y = rnorm(n2, mu2, sg2) # sample data

x = c(3.30171812857776, 6.43717247556227, 5.16202091771867, 0.0882133027828953,

5.712631788647)

y = c(-19.7414623138356, -24.4788374324584, 8.44049162542419, -13.2825346398374,

13.1135893503267, 5.39433897900108, -2.20729228848036, 22.5911923579581,

10.2702860767471, 22.7468609896814)

d = f(cd(x), cd(y))

plot(NULL, xlim=range(d),ylim=c(0,1))

edf(d)

dput(x) # c(3.30171812857776, 6.43717247556227, 5.16202091771867, 0.0882133027828953, 5.712631788647)

dput(y) # c(-19.7414623138356, -24.4788374324584, 8.44049162542419, -13.2825346398374,13.1135893503267, 5.39433897900108, -2.20729228848036, 22.5911923579581,10.2702860767471, 22.7468609896814)

mean(x) # 4.140351

mean(y) # 2.284663

sd(x) # 2.545309

sd(y) # 16.74517

mean(d) # 1.886195

sd(d) # 6.194256

##############################

# RUN THIS PART IN RAMAS RISK CALC

##############################

#tOp = 0.99995

#bOt = 0.00005

#n1 = 5

#n2 = 10

#meanx = 4.140351

#meany = 2.284663

#sdx = 2.545309

#sdy = 16.74517

#x = meanx + sdx * t(n1-1) / sqrt(n1)

#y = meany + sdy * t(n2-1) / sqrt(n2)

#z = x |-| y

#mean(z) // 1.855688

#stddev(z) // 6.216346201

############################################################################################ red

# MULTIPLICATION OF NORMAL MEANS

f = function(a,b) a*b

cd = function(x) {n = length(x); mean(x) + sd(x) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

areametric = function(u) sum(abs(sort(u)-((1:length(u))/length(u))))/length(u)

n1 = 8

n2 = 8

sigma1 = 1

sigma2 = 1

some = 2000

many = 2000

mus = c(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 1.2, 1.6, 2, 3, 4, 6)

mus = c(0.0, 0.3, 0.6, 0.9, 1.2)

m = length(mus)

A = matrix(Inf, nrow=length(mus),ncol=length(mus))

par(mfrow=c(5,5))

for (j in 1:m) for (k in 1:m) {

mu1 = mus[[j]]

mu2 = mus[[k]]

u = array(0, some)

for (i in 1:some) {

x = rnorm(n1, mu1, sigma1) # sample data

y = rnorm(n2, mu2, sigma2) # sample data

d = f(cd(x), cd(y))

u[i] = sum(d < f(mu1,mu2)) / many

}

a = areametric(u)

A[j,k] = a

plot(c(0,1),c(0,1),type='l', col='red')

edf(u)

title(paste('means',mu1,mu2))

cat(j, ' ',k, ' ',a, ' ',n1,'~N(',mu1,sigma1,') times ', n2,'~N(',mu2,sigma2,')\n',sep='')

}

#par(mfrow=c(1,1))

#persp(mus,mus,A,theta=85)

############################################################################################ red

# DIVISION OF NORMAL MEANS

f = function(a,b) a/b

cd = function(x) {n = length(x); mean(x) + sd(x) * rt(many,n-1) / sqrt(n)}

edf = function(x) lines(sort(x), 1:length(x)/length(x),type = 's')

areametric = function(u) sum(abs(sort(u)-((1:length(u))/length(u))))/length(u)

some = 2000

many = 2000

m = 5

A = matrix(Inf, nrow=m, ncol=m)

par(mfrow=c(m,m))

for (j in 1:m) for (k in 1:m) {

n1 = round(runif(1,2,10))

n2 = round(runif(1,2,10))

mu1 = signif(runif(1,-15,15),3)

mu2 = signif(runif(1,-15,15),3)

sigma1 = signif(runif(1,0,10),3)

sigma2 = signif(runif(1,0,10),3)

u = array(0, some)

for (i in 1:some) {

x = rnorm(n1, mu1, sigma1) # sample data

y = rnorm(n2, mu2, sigma2) # sample data

d = f(cd(x), cd(y))

u[i] = sum(d < f(mu1,mu2)) / many

}

a = areametric(u)

A[j,k] = a

plot(c(0,1),c(0,1),type='l', col='red')

edf(u)

if (abs(mu2)<sigma2) text(0.2,0.5,'PROBLEM')

title(paste(mu1,sigma1,'/',mu2,sigma2))

cat(j, ' ',k, ' ',a, ' ',n1,'~N(',mu1,',',sigma1,') times ', n2,'~N(',mu2,',',sigma2,')\n',sep='')

}

# It seems the only restriction is not to divide by zero. That is, apparently, divisions of

# normal means are fine so long as the true mean of the divisor (denumerator) has an

# absolute value larger than its true standard deviation. The distribution of the dividend

# (numerator) can be anything; it can be positive or negative or straddle zero. The divisor

# distribution can be positive or negative, but it cannot straddle zero.

# Of course, analysts do not know the true parameter values, so we should devise a test

# based on data to discern whether the result of a division is likely to be reliable. Although

# the basic effect of such a test is obvious, it is not clear how it should be formally designed.

# When the divisor is near zero and the quotient is negative (because just one of the terms

# is negative), the quotient is likely over-estimated and the Singh plot will be above the unit

# uniform. When this bad quotient is positive (because the terms share the same sign), the

# quotient is likely under-estimated and the Singh plot will be below the unit uniform. One

# might think that these facts could be wrangled to compute crude but reliable bounds on

# the quotient. Unfortunately, the nature of the misestimation does not allow this. Whatever

# the sign of the quotient, the estimate is closer to zero than it should be. There does not

# seem to be a way to glean how much absolutely bigger the estimate should be. The

# estimates already have long, fat tails because we are essentially dividing by zero to

# estimate them.

# The following code snippet illustrates these conclusions. It makes three graphs which

# display (i) the c-box underestimating the true quotient m=5, (ii) the would-be imprecise

# c-box obtained by enveloping the first c-box with its negation, and (iii) the natural scale

# that R choses to display these objects.

m = 5

par(mfrow=c(3,1))

x = rnorm(8,m,1)

y = rnorm(7,1,5)

q = f(cd(x), cd(y))

p = -f(cd(x), cd(y))

plot(NULL,xlim=range(c(-10,10)),ylim=c(0,1))

edf(q)

abline(v=m)

plot(NULL,xlim=range(c(-10,10)),ylim=c(0,1))

edf(q)

edf(p)

abline(v=m)

plot(NULL,xlim=range(c(p,q)),ylim=c(0,1))

edf(q)

edf(p)

abline(v=m)

Page updated

Google Sites

Report abuse