A normal curve is going to be encountered a lot in this course, so it is important that you know how to recognize one, at least on a basic level.
A normal curve looks like the top image. It essentially looks like a large hump when data is represented. Usually the data being represented is a histogram, so you might see that is is made up of bars, like the bottom image.
The average is represented by the dashed line in the middle of hump. In a perfect normal curve, the mean, or average, is also equivalent to the median and to the mode. HOWEVER, this is rarely perfect with real data. So generally it is best to assume that the center line represents only the mode and can approximate the average if it is a nearly perfect curve (like those shown here) unless you are told otherwise.
https://getnave.com/blog/frequency-distribution-types/
Anytime you are looking at a lot of data (a good sample size), you are more likely to see this normal curve (aka normal distribution). That is because there are more likely to be a lot of individuals (or measurements, it depends what the variable is) near the average than on the extremes.
Let's look at a specific example. Think about height. If I took the height of every male in the cafeteria (I am choosing only males to make the data simpler for the example), would most people be extremely tall (Over 6ft, 8 inches)? No, most people would be toward the middle. Some would be taller, some shorter, but most are clustered near that average. The same, typically, will occur with any variable you are measuring - it could be time spent on homework, income of a town, etc.
https://www.quora.com/How-can-I-tell-if-I-see-a-normal-distribution
No, not quite; there are a lot of commonalities across normal curves, but there can be some differences as well. Firstly, the variable being measured might be different (the variable on the x-axis).
Some normal curves are more spread out, some are narrower. You will be exposed to a wide variety of these curves, but the images on the left represent a good basic representation of the different kinds you will encounter.
No, unfortunately, the differences represented by this 'spread' are not just visual. There are numerical differences. Remember those things from statistics or other math classes that you remember calculating but never necessarily understood what they mean? Well here I am specifically talking about standard deviation. Standard deviation sounds scary, so I think it deserves its own section.
https://www.statisticshowto.com/pearson-mode-skewness/
Sometimes, you can see the shape of the normal curve, but it appears to have an elongated tail. That would appear similarly to this.
Please note, if there is data to the right of the normal curve (tail is on the right side), we call that right-skewed because the data is right-skewed. So don't accidentally call that left-skewed just because the normal curve is on the left!
Well, just like with anything you are unfamiliar with, it is best to break the words down and think about how they might make sense in the context provided. So the first part of 'standard deviation' is 'standard'... Okay, so maybe this idea can be applied to a lot of data sets. Let's move on to the second word, 'deviation'. Well, if someone deviates from a certain path (like Anakin Skywalker deviating from the light side of the Force... oh sorry, spoilers I guess...), what does that mean? That means he strayed away from that path; he is no longer on that path. So, can data do the same thing? Well let's check it out:
https://blog.minitab.com/en/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values
https://blog.minitab.com/en/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values
Okay I see some graphs... Let's break them down. Both are sets of data, and the variables don't really matter for the example. The scales on the axes are very similar, so we don't need to worry about that. Let's just look at the scatterplots. The graph on the left has a line of best fit (that's the red line)...If you don't remember what that is, it is basically the line that is as close as possible to ALL of the data points (the blue dots). It fits the data the best it can. The graph on the right has the same thing, a line of best fit.
What's the difference between these two images, then? Well, you can see a lot more of the line in the image on the left. That's because the data points are way more spread out. That means that the data DEVIATES from that line of best fit a lot. So, the data on the left has a higher standard deviation than the data on the right. The data on the right is more clumped together around that line... each point, on average, deviates far less in that graph from the line of best fit than the data points on the left image.
Imagine you had to measure the distance between all the points and the red line. So you have to go from the red line to each and every data point. On average, you're going to measure a lot larger of distances on the graph on the left. That is a visual for standard deviation.
Well, if you have a higher standard deviation, you probably will have less confidence in your data. If the data seems to jump all over the place, I don't want to rely on the 'average' that is the line of best fit, because I might be very, very wrong. But if most of the data sticks close to that line, I'm less worried.
https://apcentral.collegeboard.org/pdf/ap-biology-equations-and-formulas-sheet.pdf
I'm glad you asked, there is an equation, and it is even provided to you on the AP exam and all of my exams. It is written as:
s is the standard deviation
∑ means 'sum of', so you have to calculate (xi - x̄)2 for each data point and then add them up
xi is every data point, so you have to do it 5 times if there are 5 data points
x̄ is the average for the data
n is the sample size (how many data points are there?)
Well, if there are more data points, I can be more confident in my estimates. So, that means that standard deviation will decrease. More data is always better - I'd rather have data I can be more confident in, even if it doesn't support my initial hypothesis. You can think of it mathematically as well. Look at the equation above. If n, or the sample size, is bigger, standard deviation will have a larger number on the denominator, leading to a smaller value of s. Never look at an equation and just plug and chug. Pay attention to what the variables mean. They are where they are for a reason!
Standard error of the mean is kind of similar to standard deviation in that they can represent how confident we are in the data. Basically, the standard error of the mean estimates the variability between sample means that you would get if you took several samples from the data set.The standard error of the mean can be represented with standard error bars, which are visually shown on a graph. The size of the standard error bars says a lot. If I have lots of data in my sample, I will feel more confident, and thus the standard error bars will be smaller.
Standard error bars will look a few different ways depending on the data, but you will often see them in scatterplots (left) and bar graphs (right). Basically, the point (on the scatterplot) or the top of the bar (on the bar graph) represent the data that was collected, but the standard error bars basically show a range of data that would be unsurprising to see in the circumstances. So the smaller the error bars, the more confident we are in the data represented by the scatterplot or bar graph.
https://www.wolfram.com/language/12/core-visualization/error-bars-and-fences.html?product=mathematica
https://superuser.com/questions/751266/add-custom-error-bars-to-multiple-series-in-one-graph
https://apcentral.collegeboard.org/pdf/ap-biology-equations-and-formulas-sheet.pdf
Good news! It's another equation that is provided to you on the exam.
Wait, is that the same s from before?
Yup. You need that standard deviation before you can calculate standard error of the mean. Remember when I said that standard error and standard deviation are kind of similar, well, this equation is precisely why. The bigger the s, the bigger the standard error.
Now that you have your standard error of the mean, you simply multiply it by 2. Now add and subtract 2*Standard error of the mean (SEM) [mean +/- 2SEM]. Those values are what you mark for your standard error bars.
Okay, this is the last real statistics thing you have to know for AP Biology, I promise! A Chi-Square test is basically a way for us to determine how accurate (or inaccurate) our predictions were. So a bag of M&Ms has 6 colors: blue, orange, green, yellow, red, and brown. If the factory claims that they make equal amounts of every color (they don't claim this, but let's pretend), you would expect each color to make up roughly 1/6, or 16.67% of the total number of M&Ms made. This is our assumption or estimate; this is what we would expect. We would call this our null hypothesis... it's kind of like our standard, boring option. This is the hypothesis that we will assume is correct unless we can disprove it.
A Chi-Square test can be used to test whether or not this is actually a decent claim (hypothesis). So we need a sample... Let's just buy a bag of M&Ms, and count up how many there are of each color. As long as it's a big enough bag (large enough sample), then we should have some confidence in our findings! These values that we count ourselves is what we observe.
The equation is represented here. But it is much more helpful to break it down in a chart like you will see in a minute.
https://apcentral.collegeboard.org/pdf/ap-biology-equations-and-formulas-sheet.pdf
Okay so let's pretend that I did buy a bag of M&M's, and I found the following numbers:
This kind of table is SO USEFUL for this test, and you'll see why. It looks scary, just take it piece-by-piece.
So I have my observed (o) values from counting. Now I need to figure out my expected (e) values... Well, what did I say we would expect to see based on our assumption that the colors are equally distributed? I would expect each color to represent 1/6 of the total. So I can just put 16.67% as my expected, right?? WRONG.
You cannot put a percentage in the 'expected' row. Remember, we're working with numbers of M&Ms in this table. So how many brown M&Ms would I expect to see IF they represent 1/6 of the total made? Well, it would depend on how big the bag is. So, I need to make sure I am looking at the same sample size as my actual data. So if I have a bag with 180 M&Ms in it, and it's a perfect world where every color is equally represented, I would expect 30 of each color. So now I can fill in my 'expected' row. It is all the same for every color because we are testing whether or not they are equally distributed. Sometimes, you'll work under different assumptions, and we'll test that in class. Now our table looks like:
Now all I have to do is go down the columns and fill in each row. So the first row says 'Difference (o-e)'. This means that I have to calculate o-e for that column. Then I can move on down the rest of the column. Eventually, the first column will look like:
Now I need to do this for all the columns. It will look like:
Now remember that ∑ from the equation? That means we need to find a sum. We need to add up that bottom row. That will finally give us our Chi-Square (χ2) value. So add it up and you get 17.98, and it enters the table:
This is, indeed, our Chi-Square value. But, we have one final step, and it is the most often-forgotten step. You've done all of this math, you're tired. Don't make it all for nothing by getting the answer wrong now.
Our last step is to take that Chi-Square value and compare it to what is known as the critical value. Basically, you are given a table, and that table tells you, based on your data, what value of Chi-Square will be significant. The table that you are provided, the critical value table, looks like:
Here's the easy part. First, we need to look at the p-value chart. A p-value is tricky to understand, but it basically just answers the question: 'if there was really zero effect, how likely is it that you would still get results like the ones you have?' (Science Fictions by Stuart Ritchie). In other words, it represents the chance that our results are a false-positive. The lower the p-value, the more confident we can feel in our undertanding of the statistical test.
Almost all scientists rely on a p-value of 0.05. So you only need to worry about that top row. You can totally ignore the bottom row (UNLESS A QUESTION SPECIFICALLY TELLS YOU TO USE THE VALUE OF 0.01, but that is rare). There are some fields, such as medicine, that may use a lower p-value. A 5% chance of a false-positive may sound okay when researching mating in crickets, but it might be a whole different story when you are testing the efficacy of a medication in humans, for instance.
Now we need to know our degrees of freedom. Don't worry too much about what that means or why it's called that for AP Biology. It is interesting and it can be useful, but I think it might just be too much to add on here. Basically, in a Chi-Square test like this, the degrees of freedom can be easily calculated by finding the number of categories we were looking at. So, we were measuring how many M&Ms there were of each color. So how many categories could an M&M be placed in? Well, there were 6 colors. So there are 6 categories. BUT WAIT. That doesn't mean our degrees of freedom value is 6. We actually have to take one away from that. It is 5, because 6-1=5.
Why did we take one away? Well, basically if an M&M wasn't red, blue, yellow, green, or brown, there is no other option than orange. So there are really, in a way, only 5 other options choices. Don't worry about this unless you have AP Statistics!
That's all well and good, I now know I have a critical value of 11.07 from the chart (remember, I'm in the top row because p=0.05, and I'm in the 5th column because df=5). Now, all I need to do is compare my Chi-Square (χ2) to that critical value. My Chi-Square value was 17.98. The critical value is 11.07. My Chi-Square (χ2) is larger than the critical value.
As a result, we state that 'we reject the null hypothesis'. Basically, our Chi-Square value shows that the number of M&Ms from each color were waaaaaay off. They are not equally distributed. Each color does not represent 1/6 of the total. So we reject that hypothesis. Therefore, we accept the alternative hypothesis of 'The M&Ms are not evenly distributed'.