What you need to know about Calculus and Probability

In this section, we'll introduce some of the major mathematical concepts we'll be applying throughout the course. Oftentimes these concepts are first seen as terrifying to new learners. After completing this module, though, we should be able to squish those fears and use Python to calculate some of these functions for us.

Taking Limits of Functions

Calculus Basics

Derivatives: The rate of change

Integrals: Summing up functions

Relating Integrals and Derivatives

Numerical Integration: Do it yourself!

The Gaussian Probability Distribution, Error, and Covariance

Precision and Accuracy

The Gaussian Probability Density Function

What's a Covariance?

Bayes' Theorem

Taking Limits of Functions

We're going to talk about something similar to calculus, and that is taking the limits of functions. Taking a limit is pretty useful in physics and can help you to describe a complex system in a simple way. Typically, we ask the questions:

What if a variable approached a very small value?
What if a variable approached a very large value?

In these limits, we can take approximations of formulas and simplify expressions. The notation looks like this, where we are taking the limit of x in some function f(x):

Some functions have some obvious limits. Others, not so much. Let's take a look at some of the ones you can do just by looking at them. Let's take a look at this example. x is approaching a very large number: infinity. When x approaches infinity, this function will be vanishingly small; one over infinity is basically zero. In this case, we can find a limit by substituting for the value directly.

The same argument applies to this function. Here we take our limit and see that and exponent raised to the power of negative infinity is zero.

Some other limits may not be so obvious. You would then need to use your mathematical intuition to think about which functions are larger than others or which increase faster than others. In this example, we see that there is a factor of x in both the numerator and denominator. If we took the limit as x goes to infinity, we see that the x in the denominator will be much greater than one. We can then neglect that +1 and simplify to x over x, which is simply 1. However, if we changed the denominator to 1+x*x, this limit would go to zero, as a number squared grows faster than just the number itself.

In fact, in general, there is a hierarchy of the rate at which functions grow which can help you take limits. Here is a useful image (found here) that can help you in the future

Many times, we do not take the limit to go to infinity, but rather look at the case when one variable is much smaller than 1. There is a well-known approximation we can take that helps us out. In this limit, we have a polynomial function raised to some power p. We introduced new notation here, where the << symbol translates to "much less than" and the squiggly equal sign means "approximately equal to". We find this limit through a Taylor Expansion. In this course, you will not need to know exactly how this works and we will provide to you the necessary functions.

Another important limit is the small-angle approximation. The sine function, when the angle is very small, can be approximated like this.

Here is an external resource that may be useful in solidifying your understanding of limits.

Calculus Basics

Saying the word 'calculus' nowadays can send shivers down anyone's spine.

To get our conversation on Calculus started, read this excerpt from the 1910 textbook Calculus Made Easy. Keep these concepts in-mind as we explore the notation of calculus in more depth

As you can see, Calculus isn't too hard to understand. This set of math helps us to determine two major concepts:

The rate of change of a variable
The summing of functions

Knowing this, let's dive in to learning some of the notation.

Derivatives: The rate of change

Calculus deals in the limit of infinitesimal quantities, or really really tiny values. We're interested in how quantities change in an instant, so we evaluate our changes in one variable with respect to another when these changes are really really small. We use the notation on the right to say that we're looking at a really really small change in the variable x. The mathematical name for this quantity is an infinitesimal.

When defining a rate of change, we're looking at how one infinitesimal quantity changes over another. If we're looking at two variables, x and y, and we want to show how x changes with respect to y, then we can write this quantity like what we see on the right. This rate of change is called the derivative of x with respect to y. The derivative can also be interpreted to be the slope of a function at a given point.

A derivative tells us how one variable changes with respect to another variable. You may be familiar with the speed of an object, like how fast a car moves. Well, that's actually a derivative of the car's position with respect to time. We can describe how fast a car is moving in miles per hour or kilometers per hour. This tells you the rate of change of your position. If you're traveling at a constant 60 miles per hour, in one hour you will have traveled 60 miles from your original position.

Of course, life isn't so simple as traveling at a constant speed. There's traffic, speeding up and slowing down, stopping at lights, etc. We can then take another derivative of you speed to find how your speed is changing with respect to time. This quantity is known as acceleration. In terms of calculus, this is a second derivative of the position with respect to time. We write a second derivative like the notation on the right. This tells us the rate of change of the rate of change. Click here for more information on derivatives.

Note that sometimes we use notation that's a little different from what we just outlined. Sometimes we want to take the derivative of a function with respect to a length or time. Given a function, f, that has two variables x (a length) and t (time), we have the short-hand notation given on the right to denote a derivative. A spatial derivative is shown as a prime (the ') and a time derivative as a dot. If there is a second derivative being taken, then there are two primes or two dots above the function. The number of derivatives being taken the corresponds to the number of primes or dots you see over a function. This notation is much simpler than the earlier one and easier to write, so you'll often see this notation throughout the course.

We can also combine derivatives to show how multiple variables change with respect to others to simplify some of the math we do. Say we have three variables x, y, and z. We know how x changes with y and how y changes with z. From this information, we can actually find out how x changes with z by multiplying two derivatives together, like we do on the right. By swapping terms in the numerator, we can rewrite the derivatives (which are only fractions) to have one fraction be dx/dz and the other be dy/dy. How does y change with respect to y? Well that's just a 1-to-1 relation. That derivative simplifies to 1 and we're left with what we wanted to find, the rate of change of x with respect to z. This is a very powerful relation that will help us tremendously in our course.

Click here to find a table of some common derivatives.

Integrals: Summing up functions

Looking at what's written on the right, you may think you're way over your head. But not to worry! It's not as bad as it seems. The big squiggly 'S' is what's known as an integral, or sometimes as an anti-derivative. By evaluating this function, we are simply summing up the contributions of the function over little slices of x over the range of a to b (a is the upper limit, b is the lower limit). The range of values we sum over is known as the limits of the integral.

A common way to visualize this is to first look at an integral in action by finding the area of a rectangle. You probably know that the area of a rectangle is its base times its height off the top of your head. We can actually evaluate this with calculus! A rectangle has a constant height, h, and we want to find the area over its base, b. We can then take the integral shown here to calculate the area. In effect, what the integral is doing is dividing the area of the rectangle into itty-bitty slices and summing up all those slices to find the total area. Check out this link for more information on integrals and a table of common integrals.

Relating Integrals and Derivatives

In the previous sections, we outlined that a derivative is a rate of change and that an integral sums a function. The two concepts we learned are actually related to one another through the fundamental theorem of calculus, which can be written in two ways.

Say you have two functions, g(x) and f(x), both functions of x. If they are related to each other in the way on the right, then it necessarily means that g'(x) = f(x). You can read more about it here.

I think the other way of writing the fundamental theorem of calculus is a little more intuitive and uses a lot of the language we already defined. Say we have a function, f(x) that has a derivative f'(x). If we take the integral of f'(x) from a to b, then the integral is simply the difference between f(b) and f(a). What we are doing in the integral summing over the rate of change of f(x) with respect to x over a range of x. But since we are looking at the rate of change, f'(x), and are multiplying by the change of x in the integral, dx, we can interpret a simplified expression which rewrites the integral simply as the change in the function f between two values of the function. Note that the limits changed in the integral when we simplified the expression: we're no longer summing over the contributions of dx, but df(x). The limits then change to the function's value at a and b, corresponding to the limits of the integral we were originally taking in x.

Numerical Integration: Do it yourself!

Here we will apply what we learned above into code! Open your Google Drive folder for this course (that you downloaded here),open the "Intro to Numerical Intregration.ipynb" file in Google Colab and take a crack at it. If you haven't already, you should look at the Intro to Python notebook and complete that first.

The Gaussian Probability Distribution, Error, and Covariance

Precision and Accuracy

To start our conversation on probability, we as scientists want to ensure that whatever we measure is precise and accurate. You may initially think of these words as interchangeable, but they have distinct meanings.

Accurate measurements recover the true value.

Precise measurements reproduce the same result with great certainty.

A picture is worth a thousand words. Take a look at the figure to the side to solidify the difference between precision and accuracy. Scientists want their measurements to be both accurate and precise!

The Gaussian Probability Density Function

When we make a lot of measurements of something, all of those data will be centered on the average (also known as the mean or the norm), with some scatter about the mean, known as the standard deviation. The 'scatter' tells us how precise our measurements are, while the mean tells us about our data's accuracy. A popular way to describe these kinds of measurements mathematically is through a Probability Density Function (PDF), which will tell us how probable a measurement is using previously recorded data over the entire range of possible values.

The PDF evaluates to a value between 0 and 1. 0 indicates that the new measurement would have been impossible to make based on the previous data, while 1 indicates that the new measurement would have certainly been made in the previous data; all the values in between describe different levels of certainty that the new measurement fits the rest of the data. If we measure many different things and want to find the total probability, we multiply the individual probabilities together to find the total (we don't sum them!). This maintains that the final total probability will still lay between 0 and 1.

The PDF that's most relevant to us in this course is called the Gaussian Distribution, also known as the Normal Distribution. This particular distribution describes the expected symmetric scatter around the mean of a measurement.

Let's take a look at what StatQuest has to say about the Gaussian Distribution! As you watch, pay close attention to how the standard deviation affects the probability of finding heights in a certain range.

From watching the video, we see that we can sum a bunch of probabilities in a given range to find the total probability of finding a height within that range. But this also means that if we summed up the probability for every single possible height, we should find that the total probability is 1. This ties into the definition of a probability density function: Because the PDF is defined as the probability of a measurement over the entire range of possible values, summing up all of the probabilities at every single possible value should indicate that a new measurement will definitely fall in this range.

The formula for the Gaussian Distribution is given here, where μ represents the mean of the distribution, σ the standard deviation of the distribution, and x some new data. σ2 is given a special name called the variance. If we were interested in finding the total probability of finding some new data within a given range, we can use our new-found calculus skills to sum up all of the probabilities to get our answer! Some of these important total probabilities are given below:

To interpret these results, look at the upper and lower bounds of the integrals. We are progressively summing up the contribution of more and more probability. Making a measurement within one standard deviation of the mean will occur 68.3% of the time, which also means that making a measurement outside of one standard deviation of the mean will happen 100%-68.3% = 31.7% of the time.

When scientists report results, they often indicate the result equal to μ ± σ, indicating the mean and the standard deviation (or level of error/scatter) of their measurements.

What's a Covariance?

In general, though, we're going to be investigating a bunch of parameters with many measurements. If they're uncorrelated with one another, or if changing one parameter doesn't affect another, then we can simply multiply individual probability distributions together to find the total probability. However if our measurements are correlated, we need to make sure we understand how one parameter affects another. This leads to our definition of the covariance, or how changing one parameter affects the others. For organizational purposes, we can store all of our parameters as a list of values (i.e. a vector). However, if parameters change one another, we can't describe this distribution with just the variance of each parameter: we need to also consider cross-correlation, or covariance.

Let's say we have N parameters that we are measuring. A parameter can vary within itself or with any of the other (N-1) parameters. Repeating this for every single combination of parameters, we wind up with a matrix that is N parameters high and N parameters wide, storing these (co)variances in a 2D array. On the right is an example of what the Normal distribution looks like for two parameters when we consider its general form. Note that the 2D Normal Distribution on the right has the same general form as before. The main difference here is that we are replacing σ2 with the covariance matrix, Σ.

The diagonal of a matrix is defined by the gray squares and off-diagonal values as red squares.

We can see examples of a Gaussian PDF with two co-varying parameters below. The top row of graphs plot the probability of parameters x1 and x2 as the height, whereas the bottom row of graphs view the top row of graphs from above. The different colors correspond to increasing probability, with yellow being the highest values and purple being the lowest.

On the bottom-left, we see the PDF of two parameters that do not affect one another. The non-zero terms on the diagonal represent the variance of each parameter. As we progress to the bottom-right, note how the off-diagonal terms are increasing. The off-diagonal terms represent the covariance between parameters, meaning that these two parameters affect each other more strongly than on the left.

In the above plots, we showed only positive covariance between x1 and x2. However, covariance can also be negative. Take a look again at the plots above. What do positive covariance values do to the PDF? How would a negative covariance change this?

A positive covariance indicates that when one parameter increases, so does the other. This is most clearly shown in the right-most plots. However, a negative covariance says that when one parameter increases, the other decreases.

In other words, positive covariance means parameters vary with each other, a negative covariance means parameters vary against each other, and a zero covariance means parameters do not vary with each other.

Bayes' Theorem

The last topic we'll talk about is related to using a model to predict observations. If you want to make a prediction about something you can't directly observe, then you will need a model to relate what you can observe to what you cannot. in other words, we want to find the conditional probability of observing an event using information about some other event. This is different from independent probabilities, which simply provide the probability of observing an event.

As an example, let's say you want to find a model that can predict weather patterns. Specifically, let's aim to predict whether it will rain tomorrow given the number of consecutive days of sunshine. For simplicity, we'll assume no other weather patterns exist and every day of the year is either rain or sunshine. What we can observe is the number of days of sunshine and rain in the past, but what we're interested in is making a prediction of whether you need to carry an umbrella around with you tomorrow. We can't directly observe tomorrow's weather, so we want to find a model that aims to predict it!

The way we can do this is by using Bayes' Theorem, which relates the conditional and independent probabilities of events to one another. We can write a probability as a function P. A conditional probability can be written mathematically as P(Event 1|Event 2), which gives us the probability of Event 1 given data about Event 2. An Independent probability can be written as P(Event 1), which tells us the probability of Event 1 occurring. Using this notation, we can write Bayes' theorem as follows:

For our particular example, we concern ourselves with probabilities of rain and the number of consecutive days of sunshine. We are interested in finding the probability of rain given how much sunshine we've had lately, or P(Rain | number of consecutive days of sunshine). To complete this task, we can find records of the days of rain and sunshine in the past year to find the individual probabilities. However, we still need to know P(number of consecutive days of sunshine | Rain) to complete the prediction. This is where a model is needed in our prediction! We need some theoretically motivated way to describe how many days of sunshine will follow a day of rain.

In our simple example, let's say that it's likely to have sunshine the day after it rains, but very unlikely to have 4 or more days of sunshine after it rains. Assigning the number of days of consecutive sunshine to the variable S, let's create a model that looks like:

Let's also say that we looked at past weather reports, calculated the probability of rain and the probability of S days of sunshine and found the probabilities listed here:

We now have all of the information we need make a prediction! If we have had 2 days of sunshine, what is the probability it will rain tomorrow?

Remember that probabilities multiply each other to get the total probability. Looking at our listed functions, P(S=2) = 0.3, P(S=2|Rain) = 0.2, and P(Rain) = 0.1. Therefore, the probability it will rain tomorrow to be (0.2)*(0.3)/(0.1) = 0.6, or a 60% chance of rain!

We are now able to make a prediction using Bayes' Theorem! But what if our prediction is wrong? What if it doesn't rain for another week? In this case, our model of P(S|Rain) is probably incorrect (for example, 4 days of sunshine could be just as likely as 1!).

In this course, we focus mostly on using Bayes' theorem to figure out the values of the parameters in a model using data to constrain them, or P(Model with some parameters | Data). This changes our interpretation of Bayes' theorem slightly. We will have a function for our model, M, which creates a data prediction using parameter p. We will now re-write Bayes' Theorem as follows and provide the common names for the different probability distributions:

The Prior represents your previous knowledge about the parameter values a model can have. The Likelihood represents how likely you are to observe the real data using your model's prediction using a set of parameter values. The Posterior represents how likely the parameter values of your model in combination with your model represent the real data. The Evidence, the probability of observing your given data, is a constant which scales the final posterior probability.

Out of all of these distributions, the posterior is typically what's most interesting to scientists. We won't get into calculating an example just yet. Wait until the end of Unit 2 where we will use these principles on real data!

Page updated

Report abuse