Data

This section is mostly about statistics. Most of it is from Math is fun so be sure to check that out too!

Central values

Mean: The arithmetic mean of all the elements in the data.
Median: Sort the data in value order, and then if the number of elements n is odd, you take the (n+1)/2th element.
Else, you take the mean of the n/2th and the n/2+1th elements.
Mode/modal: The element that occurs most often. (Yes, that sometimes means that there is no mode.)

Spread

There are many ways to measure spread, and some are more accurate than others sometimes.

Range (Statistics): The difference between the largest element and the smallest element.
Quartiles: So you have to cut the data into 4 pieces (with 3 dividers Q_1, Q_2 which is the median, and Q_3). We have 4 cases:
1. If n mod 4 = 0, then both Q_1 and Q_3 will lie between 2 numbers.
2. If n mod 4 = 1, then both Q_1 and Q_3 will lie between 2 numbers.
3. If n mod 4 = 2, then both Q_1 and Q_3 will lie on 1 number.
4. If n mod 4 = 3, then both Q_1 and Q_3 will lie on 1 number.

Ok so when n is odd it's a bit harder to understand.
I'll give an example when n = 5 and n = 7. (For n after that, just add one element into the spaces between the quartiles.)

(- is where the quartiles are)
n = 5: a-b c d-e
n = 7: a b c d e f g

Box and whisker plot: Draw a number line, plot the lowest value, Q_1, Q_2, Q_3, and the highest value. Then draw 2 boxes with Q_1, Q_2 and Q_3.
Interquartile range: It's a bit more accurate then the range, depending of course. Just take Q_3 - Q_1.
Percentile: The percentage of elements with value below the desired element. If grouped, it's the percentage of everything below the group plus half of the desired group. Can be estimated with a line graph.
- Deciles: groups of 10%
- Quartiles: groups of 25%

Advanced spread

Mean deviation: The mean of the differences of each of the elements and the mean.
Variance (V): The mean of the squared differences of each of the elements and the mean.
Standard deviation (σ 'sigma'): σ = sqrt(V)
Sample deviation (s): s = sqrt(V*(n-1)/n)

Why is the sample deviation slightly different from the standard deviation? Well, it's actually because we're making a tiny correction for just choosing a few of the population to put in the data.

In formulas:

Mean deviation = Σ(|x-μ|)/n
V = Σ((x-μ)^2)/n
σ = sqrt(Σ((x-μ)^2)/n)
s = sqrt(Σ((x-μ)^2)/(n-1))

Why am I writing the means in 2 different ways?

Sample mean: x̄ 'x-bar' is the mean of the sample.
Population mean: μ 'mu' is the mean of the entire population.

Comparing data

Univariate data: Just one set of data we can play around with.
Bivariate data: 2 sets of data we have to compare.

So how do we compare bivariate data?

Correlation: Correlation is related to the line y = x on the graph of 2 sets of data.
The more the graph looks like the line y = x, the closer the correlation goes to 1, and
the more the graph looks like the line y =-x, the closer the correlation goes to -1.

Ready for the formula?

(Pearson's correlation)
r_{x,y} = Σ((x-x̄)(y-ȳ))/sqrt(Σ((x-x̄)^2)Σ((y-ȳ)^2))

That's still not too bad though.

Easier formula for programmers:
r_{x,y} = (nΣ(xy)-ΣxΣy)/(sqrt(nΣx-(Σx)^2)sqrt(nΣy-(Σy)^2))

Whew!

Binomial Distribution

Now this is where it gets hard. (Refer to my probability page first if you're not sure what that even is.)

The binomial distribution is when we have some independent trials and each of them are mutually exclusive of each other.

We consider 2 things:
1. how many outcomes there are with the desired condition; and
2. what the probability of each of them happening is.

So let's utilize coin-tossing to explain all of this.

Problem: What is the probability of getting 2 heads on flipping the same regular no-nonsense coin 3 times?
Solution: Number of outcomes: 3C2 = 3
Probability of each outcome: 0.5^3 = 0.125
Total probability: 3*0.125 = 0.375

That was easy right?

Now look here: a coin has 2 outcomes (H/T), and each outcome is equally likely (50/50). But what if we had a funnily-built coin with H/T = 70/30 or something like that? How do we deal with this?

Let's revisit the problem, but with a different coin:

Problem: What is the probability of getting 2 heads on flipping the same H/T = 70/30 coin 3 times?
Solution: Number of outcomes: 3C2 = 3
Probability of each outcome: 0.7^2 * 0.3^1 = 0.147
Total probability: 3*0.147 = 0.441 (slightly higher!)

So with weighted coins the number of outcomes doesn't change, but the probability of each outcome is different.

Can we generalize this into a single formula? Yes, of course!

Formula construction: Let n be the number of coinflips, and k be the number of heads we want. Then let p be the probability that the coin lands on heads.

Number of outcomes satisfying the condition: nCk = n!/(k!(n-k)!)
Probability of each outcome: p^k * (1-p)^(n-k)

Multiply those together, and we get our formula!

The probability of getting k desired outcomes in n trials:
((p^k)*(1-p)^(n-k))(n!/(k!(n-k)!))

Yay! Now let's try this on die:

Problem: I have a no-nonsense standard dice. What is the probability that I get 4 2's after 6 rolls?
Solution: (((1/6)^4)*((5/6)^2))(6C4) = (25/(6^6))(15) = 0.008. Quite small, but just try doing that.

And that is the binomial distribution explained.

Normal Distribution

We say data is normally distributed when:
1. the mean = the median = the mode = the center,
2. the curve/histogram/whatever representation is symmetrical about the mean, and
3. 50% of the values are below the center and 50% of the values are above the mean.

Remember the standard deviation?
- Around 68.0% of the data is within 1 standard deviation from the data,
- around 95.0% of the data is within 2 standard deviations from the data, and
- around 99.7% of the data is within 3 standard deviations from the data.

Z-Score: The z-score is a measurement of how many standard deviations an element in the data is away from the mean. It can be calculated as follows:
Z-score for element x with mean μ:
z = (x−μ)/σ
Yes, it is that simple, just take the element, take away the mean, and straightaway divide by the standard deviation.

Standardizing a normal distribution: We take away the mean and just divide by the standard deviation, then we have... the standard normal distribution!

(Wait isn't that how we just got the z-score?
Well, each value is mapped onto its z-score in the standard normal deviation!
So I hope that answers a lot of questions.)

Now we have a very good question: how do we get the percentage of elements in the data that is between any range in terms of standard deviations?

We can determine for integer values. Since we know that
68% of the data is within 1 standard deviation,
- 34% is the amount of data between z-scores 0 and 1, and thus it means that
- 47.5% is between z-scores 0 and 2, and
- 49.9% between the 0 and 3 z-scores.

Add and substract and you should (about) get what you want!

Visit the Math is fun standard normal distribution table for such a good table that I had to print it out for myself to sleep with every night.

Page updated

Report abuse