For this segment, there will be three hands-on exercises. Each exercise should take you about 15-20 mins.
In this exercise, you will get practice with manually plotting error bars. It is important to learn how to manually plot error bars because spreadsheet software in general do not know how to correctly set error bars for scatterplots.
Google Sheets does not allow you to manually set error bars, so you will have to do this in Excel.
First, download the following spreadsheet to your personal device and open it in Excel:
This spreadsheet contains artificial data to simulate data with a spread of random error. You can adjust the parameters in the yellow boxes to see how this changes the trend. Your work will be to fill in the green boxes.
We start with theory. The relationship being modeled is y = bxᵏ + c.
Exercise 1: Assume Δx = 0.1. Obtain an expression for Δy.
Note: If you are more interested in working with random variance, this will be covered in Exercise 4 below
All terms here (y, x, b, k, c) can be assumed to be independent.
For summation, we add the error, so
Δy = Δ(bxᵏ+c)=Δ(bxᵏ)+Δc=Δ(bxᵏ)
since here we will take Δc=0.
For multiplication, we add the percentage error, so
Δy = Δ(bxᵏ)=Δ(b*x*x*...*x)
Δy/y = Δ(bxᵏ)/bxᵏ = Δb/b + Δx/x + Δx/x + ... + Δx/x = Δb/b + k Δx/x
Similarly, we take Δb=0
Therefore,
Δy = Δ(bxᵏ) = (bk xᵏ⁻¹) Δx
Doing this with calculus is much faster, but has the disadvantage of being less familiar to students.
For y = bxᵏ + c,
dy = d(bxᵏ) + d(c) = d(bxᵏ) = b d(xᵏ) = bk xᵏ⁻¹ dx
We therefore come to the same conclusion:
Δy = (bk xᵏ⁻¹) Δx
Exercise 2: With this expression for Δy, use a spreadsheet formula to fill in all green boxes
You should get a result that looks like this:
To calculate the average value of y, we use the formula =AVERAGE(...). The cells are from column B to F, so the appropriate formula for cell G8 is =AVERAGE(B8:F8) as shown below.
If you copy and paste this for the whole column, it should generate all averages automatically, as shown below. The row number should update automatically.
We previously obtained the expression Δy = (bk xᵏ⁻¹) Δx. We will now generate the equivalent in an Excel Spreadsheet formula.
Since the constants b and k are always in the same cell, you should use $B$2 and $B$3 to prevent the cell reference from changing when you copy and paste the formula to other cells.
In contrast, x and Δx need to be updated, so for row n, you can use An and Hn respectively (e.g. A8 and H8 for row 8).
The formula for cell I8 is therefore: =$B$3 * $B$2 * A8^($B$2-1) * H8. If you copy and paste this to the whole column, it should automatically update accordingly. Remember to round off to 1sf. You can do that using =round(...) in the formula. You should get this result:
Exercise 3: Plot the corresponding error bars
This exercise will be guided step by step.
First, select the two columns for x and y_avr and insert a scatterplot (use the circled icon):
You can insert a scatterplot like the one below.
Go to Chart Design -> Add Chart Element -> Error Bars, then pick any of the 3 default options as shown.
We will then customise the error bar appropriately. Rightclick on the error bar and go to Format Error Bars. In this case, let's do x first - click
In this case, let's do error bars for y first. Check that the top says "vertical error bar" as shown, and go to "Specify Value" at the bottom.
Since we are creating error bars for Δy, we select the appropriate column.
Repeat the same process for Δx.
Note: If you are unable to select the error bar directly on the graph, you can go to Chart Tools -> Format and then select the error bars from here.
This is the outcome:
Our theoretical values of uncertainty are very small, compared to the random error in the simulated data - you can tweak the amount of error in cell B5 ("Spread coeff.").
For example, this could be a case where a student takes measurements of the position of a falling object, naively just taking the uncertainty to be the instrumental precision of a ruler, and not realising that the high speed of the object's motion introduces a much larger source of error.
In contrast, the following shows a case where Δx = 0.3, and Δy is calculated accordingly:
Here, it is clear that the line of best fit lies within the range of the error bars; additionally, the simulated data was generated with k=2 and b=3, so the line of best fit with k=1.99 and b=3.17 is fairly close.
Now, go to the other tab - here, we will generate an error bar based on the variance in the data itself.
Exercise 4: Calculate the standard deviation for y in the tab "Using standard dev."
You should obtain something similar to this:
To calculate standard deviation, we use the formula =stdev(...). The cells are from column B to F, so the appropriate formula for cell H8 is =STDEV(B8:F8).
If you copy and paste this for the whole column, it should generate all averages automatically. The row number should update automatically.
Exercise 5: Finally, use this standard deviation to generate error bars.
You should obtain something similar to this:
Bonus exercise:
Plot ALL data points as an additional series and see how it matches up with the generated error bars.
Make a copy of the following spreadsheet in your Google Drive so that you can edit it:
Note:
1. One graph will show "No data" initially. This will be your residual plot later. You can ignore it for now.
2. The orange cells use the =LINEST formula. This is the same mechanism that Excel or Google Sheets uses to plot the line of best fit. By default, =LINEST(values of y, values of x) will give you the gradient and intercept of the line of best fit. Using this, the orange cells will automatically update whenever the artifical data is re-randomised.
Exercise 6:
You should develop a sense for loopholes and errors in analytical methods.
Take a look at the plot of y vs x. Does anything unusual stand out here?
You might notice that data in the middle is below the line of best fit, while data at the start and end are above the line of best fit. In fact, there appears to be a very slight curvature.
Despite the high correlation (R), the linear model may not be valid.
Exercise 7:
This set of artificial data is modelled as y = mx + c.
Using the values of m and c in the orange cells, for each value of x, use a spreadsheet formula to calculate the theoretical value of y in the column "y_expected", then calculate the residual (y - y_expected) in the column "y_residual"
We want to use y = mx+c. Here, m has cell reference $D$2 and c has cell reference $E$2.
Similar to the previous exercise, for cell D7, you should use =$D$2 * B7 + $E$2
To calculate the residual, we take y - y_expected. The values of y are in column C and the values of y_expected are in column D. Therefore, for cell E7, you should use C7 - D7
If you copy and paste this down, you will fill up the two columns.
Once you fill up the residual values, the residual plot should populate. What do you notice?
You can see that the residual plot is a quadratic curve. This is because, in addition to the 'correct' linear relationship, an additional systematic error proportional to x² is introduced.
Therefore, the datapoints are actually y = mx + c + kx² + random error. The automatically-generated line of best fit tries to go between these datapoints and models it as y = m'x + c'.
This results in residuals of (m-m')x + (c-c') + kx² + random error. The residuals are randomly scattered around the quadratic curve.
In general, if your residual plots are not randomly distributed, you might suspect that the proposed model is missing something.
For example, in the earlier exercises for generating error bars, if you tried to insert a line of best fit based on a power or exponential relationship, you will also obtain residuals with a curved relationship since the data follows x to the power of k, plus a constant.
Exercise 8:
Experiment with adjusting the values in the yellow boxes (B1 to B4) and in cell E4. See how this affects both scatterplots.
Make a copy of the following spreadsheet in your drive:
Exercise 9:
For this exercise, use the tab "Manual fitting" in the spreadsheet.
Using an appropriate formula, fill in the column for squared error. This should be filled with the square of the residuals.
Then use an appropriate formula to sum up the squared error in cell E2 ('sum of squares').
The spreadsheet formula for square is ^2 (e.g. =D5^2)
The spreadsheet formula for summing is =SUM(...); e.g. =SUM(E5:E54)
Exercise 10:
Adjust the values in the purple cells (C2 and D2) and see how this affects the sum of squares.
Note that the line of best fit has values of m and c as shown in the scatterplot. Are you confident that these values of m and c have the smallest sum of squares?
Exercise 11:
Now, take a look at the other tab labelled "Grid search".
Grid Search is a technique where a range of values are tested for each parameter, which gives a very quick overview of where the general min and max values are.
The grid provided on this page calculate the sum of squares for the same dataset using a range of values of m and c. The values of m and c are generated using the parameters in the yellow squares (B1 - B4). These are the middle values of m and c in the grid, and the size of the spread (interval between values of m and c for testing).
Try adjusting the mid and interval values of the grid to narrow down the optimal values of m and c that minimises the sum of squares.