statistics need:
a specific question
a measurable answer
describe
summarize
mean
median
mode
range
quartiles
interquartile range (very robust against outliers)
variance
standard deviation
inductive conclusion from sample abou t the whole
sampling
with replacement
idependent probabilty
without replacement
conditional probability
P(A|B) = P(A∩B) / P(B)
without replacemant
distribution
discrete
continious
uniform
normal distribution
mean
standard deviation
bimodal etc.
without replacement
binominal
true, false
sequence of indipendent events
expected value = n * p <=> p = expected value / n
expected valie = mean of propability distribution
the bigger the sample the closer the mean of the sample to the expected value i.e. law of large number
the bigger the sample sice, the closer the result gets to the normal distribution i.e. mean of full population
probability of events over a fixed period of time
lambda is the average events in the time period i.e. the expected value
define target population
state null hypthesis
assume nothing
indipendent variable
state alternative hypothesis
opposing null hypothesis
dependent variable
collect sample data
the bigger the sample, the closer the mean of the sample to the mean of the population (i.e. central limit theory)
test sample statistically
experiment
What is the effect of the treatment on the response
treatment: indepentent variable
respone: dependent variable
controlled experiment
treatment group vs. non-treatment control group
A/B-testing
avoid bias
randomisation
blinding
double-blinding
draw conclusion about sample
The null hypthesis H₀ claims that the studied effect does not exist. There is no relationsship between two sets of data.
The alternative hythesis H₁ claims that the studied effest exists. There is a relationship between two sets of data.
correlation coefficient
What do we know about y when we know x?
relationships between -1 and +1
0.99 very strong
0.75 strong
0.50 moderate
0.20 weak
0.0 0 none
Don't confuse correlation wiht causation!
confounding variables
The p-value is the probability to archieve the result or more given the null hypotesis is true.
significance level α = 0.05
p < α 🠖 result is statistically significant
type I error: null hypthesis is actually true, but it is falsely rejected as false, i.e. false positive
type II errror: null hypothesis is actually false, but is falsely accepted as true, i.e. false negative
changes in the independent variable correlates with changes in the dependent variable
predicted condition
pos
predicted condition
neg
total
actual result condition pos
true positive TP
1-β
false negative FN
type II error, β
TP + FN
sensitivity
TP / (TP + FN)
actual result condition neg
false positive FP
type I error, α
true negative TN
1-α
FP + TN
specificity
TN / (FP + TN)
true positive: predicted positive, actual positive
false positive (i.e. type I error): predicted positive, but actual negative -- like smoke alarm going off without smoke
true negative: predicted negative, actual negative
false negative (i.e. type II error): predicted negative, but actual positive -- like smoke alarm not going off from smoke
sensitivity = (TP / (TP + FN))
rather flag then not flag
specificity = (TN / (FP + TN ))
rather not flag than flag