Part zero: normal data vs. skew data.
Some 'measures of central tendency'
mean. best for data that is evenly spaced/spread out/minimal outliers
median. better for skewed data/data that doesn't fit a bell curve
mode. less useful--can show when patterns of the data are off (e.g. the mode is in a very different place than the mean or median, we are dealing with a strange data set.
Describing a normal curve:
N(mu, sigma) -- mu represents the mean of the population, while sigma is the standard deviation. standard deviation is the spread of the data. larger the sigma, the more spread of our data.
Good website for this: http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+measures+of+spread
heights of males [age 20-29] N(69.3, 2.8) (inches)
heights of females [age 20 - 29] N(64, 2.7) (inches)
let's find what's call a z-score. this is a 'standard error' which shows us how far away our data point is from the center.
68 - 95 - 99.7 rule.
salaries of Bennington, VT.
here, median is probably a better indication
Part one: a confidence interval
a confidence interval's job is to determine a RANGE for our ACTUAL value of true mean
as you sampled for number of credits, you ended up with some numbers. Some should be high, some should be low, but we all have ou own average. Our job should be to try and figure out a range for where we think the true mean is.
we need: our sample mean, the size of our mean, the population standard deviation, and a z-star value.
in what ways can we manipulate this confidence interval?
Part two: the concept of a p value:
so, a confidence interval covers a range of values.
the true mean is either in this range, or it is not in this range.
increase the range, increase the chances it is there.
but lose meaning (the mean is somewhere between 0 and a million doesn't do much)
how sure do we want to be?
what is a p-score, and what does it do for us
NOW... we are working backwards. We have a sample, and the population mean is a mystery.
in other words, we have a guess, and we are going to extrapolate that we can use this guess and see what that mean.
take the boxes example. If I sample and get a mean of 4.9, it would seem hard for the true mean to be 100. or even 10.
more sampling could help us with that. more samples, more sureness
use the code below to toy around with this idea...this is the same thing as a confidence interval, but backwards.
HOW OFTEN IS WRONG OKAY WITH US?
the stats community says we are okay with missing the true mean about 1 in every 20 times.
the 'assumption' of alpha = .05
how do we find this?
z-score
to p-score
talk about t-score
calculators do most of this work for us, and we are going to leave them to do it for us until we do not have to anymore. remember the key here: we want to be able to assess the work of others. to understand what they do when and why.
boxes<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,12,4,5,12,8,4,4,16,4,5,10,4,5,4,10,16,16,8,6,4,5,9,10,3,12,4, 10,12,10,6,16,16,8,4,5,18,4,3,9,12,16,3,6,8,4,2,5,18,4,12,4,12,8,3,16,5,9,6,10,3,18,8,10,16,6,15,8,4, 18,10,4,2,5,8,16,6,9,12,4,9,18,8,8,8)
list=c(1:20)
plot(boxes,boxes,col="white",pch=16,xlim=c(0,20),ylim=c(0,20),xlab="mean",ylab="test number")
abline(v=mean(boxes),lwd=3)
sample.boxes=c()
for(i in 1:20){sample.boxes<-c(sample.boxes,mean(sample(boxes,10)))}
points(sample.boxes,list,pch=19)
z.star=1.960
sample.box.low=sample.boxes-z.star*sd(boxes)/sqrt(10)
sample.box.high=sample.boxes+z.star*sd(boxes)/sqrt(10)
segments(sample.box.low,list,sample.box.high,list,col="red")