Machine learning, particularly techniques like linear regression and neural networks, lies at the heart of modern data science and artificial intelligence. These algorithms empower computers to learn patterns from data and make predictions or decisions based on that learning.
In this tutorial, we will delve into one of the foundational algorithms of machine learning: linear regression. Linear regression enables us to predict a continuous target variable based on one or more input features. To optimize this prediction, we will employ gradient descent, a powerful optimization technique widely used in machine learning.
By the end of this tutorial, you will gain a clear understanding of:
Implementing a simple linear regression model from scratch in Python.
Using gradient descent to optimize the model's performance by minimizing the loss function.
Breaking down the code step-by-step to grasp the purpose and functionality of each component.
Combining all parts into a cohesive Python script that implements linear regression using gradient descent.
Understanding linear regression and gradient descent is fundamental for anyone entering the field of `machine learning` or data science. These concepts serve as building blocks for more advanced algorithms and techniques, such as `neural networks` and `deep learning`.
In this first part of the tutorial, we focused on simple linear regression. Simple linear regression involves modeling the relationship between a single independent variable (`x`) and a dependent variable (`y`) with a linear equation of the form `y = ax`, where `a` is the parameter (`slope`) we aim to learn. This type of regression is called "simple" because it uses only one feature to make predictions. In contrast, multiple linear regression uses multiple features to predict the dependent variable, making it suitable for more complex datasets.
We'll start by showing the entire code, then break it down into individual parts with explanations for each segment.
This code is also available in the Google Colab Notebook.
Python
import numpy as np
import matplotlib.pyplot as plt
# Initialize data
x = np.array([1., 3., 10., 13., 7.])
y = np.array([2., 6., 20., 26., 14.])
# Initialize parameter 'a' randomly
a = np.random.rand()
# Hyper-parameters
learning_rate = 0.001
iterations = 100
N = x.size
# Lists to store loss and parameter 'a' values for plotting
losses = []
a_values = []
# Gradient descent loop
for i in range(iterations):
# Calculate predictions
y_pred = a * x
# Compute loss (Mean Squared Error)
loss = (1/N) * np.sum((y_pred - y) ** 2)
losses.append(loss)
# Compute gradient
gradient = (2/N) * np.sum(x * (y_pred - y))
# Update parameter 'a'
a = a - learning_rate * gradient
a_values.append(a)
# Print loss and 'a' every 10 iterations
if (i+1) % 10 == 0:
print(f"Iteration {i+1}: Loss = {loss:.4f}, 'a' = {a:.4f}")
# Print the final optimal parameter 'a'
print(f"Optimal parameter 'a': {a:.4f}")
# Plotting the linear regression result
plt.figure(figsize=(14, 5))
# Plot the original data and the regression line
plt.subplot(1, 2, 1)
plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, a*x, color='red', label=f'Linear regression: y = {a:.4f}x')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()
# Plot the cost function
plt.subplot(1, 2, 2)
plt.plot(range(iterations), losses, color='purple')
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function')
plt.tight_layout()
plt.show()
Python
import numpy as np
import matplotlib.pyplot as plt
This section imports the necessary libraries. `numpy` is used for numerical operations, and `matplotlib.pyplot` is used for plotting graphs.
Python
x = np.array([1., 3., 10., 13., 7.])
y = np.array([2., 6., 20., 26., 14.])
In this part of the code, we initialize our data points. Let's break down what each line does and why it's important for our linear regression model.
Python
x = np.array([1., 3., 10., 13., 7.])
Here, `x` represents the independent variable (also known as the predictor or feature). It is a numpy array containing the following data points: `1, 3, 10, 13, and 7`. These values could represent any measurable quantities, such as time, distance, temperature, etc., depending on the specific problem we're trying to solve.
Python
y = np.array([2., 6., 20., 26., 14.])
Here, `y` represents the dependent variable (also known as the response or target). It is a numpy array containing the following data points: `2, 6, 20, 26, and 14`. These values are the outcomes or responses corresponding to the values in `x`. For each value in `x`, there is a corresponding value in `y`.
The relationship between `x` and `y` is what we aim to model using linear regression. Specifically, we are looking to find a linear function `y=ax` that best fits the data. In other words, we want to find the value of the parameter `a` (`slope`) that minimizes the difference between the predicted values (using the linear model) and the actual values in `y`.
Let's say we are dealing with a real-world scenario where `x` represents the number of hours studied by students, and `y` represents their corresponding test scores. The data points can be interpreted as follows:
A student who studied for `1` hour scored `2` points.
A student who studied for `3` hours scored `6` points.
A student who studied for `10` hours scored `20` points.
A student who studied for `13` hours scored `26` points.
A student who studied for `7` hours scored `14` points.
Our goal is to use these data points to find a linear relationship between hours studied and test scores, allowing us to predict the test score for any given number of study hours using our model.
By initializing `x` and `y` with these arrays, we set up the foundation for applying linear regression to determine the best-fit line that describes the relationship between the independent and dependent variables.
Python
learning_rate = 0.001
iterations = 100
N = x.size
In this part of the code, we set up some important parameters and determine the size of our dataset, which are crucial for the gradient descent algorithm.
Python
learning_rate = 0.001
The learning rate is a hyper-parameter that controls the step size during the gradient descent update. It determines how much we adjust our model parameters (in this case, the slope `a`) with respect to the gradient of the loss function. A smaller learning rate means that the model updates the parameters slowly, which can lead to more precise convergence but might require more iterations. Conversely, a larger learning rate can speed up the convergence but risks overshooting the optimal solution, potentially leading to divergence or oscillation.
Python
iterations = 100
The number of iterations specifies how many times the gradient descent algorithm will run. Each iteration involves computing the gradient, updating the parameter `a`, and computing the loss. More iterations allow the model to converge more closely to the optimal solution, but also require more computational time. In this code, the gradient descent loop will run `100` times.
Python
N = x.size
Here, `N` is the size of our dataset, which is the number of data points in `x` (and `y`, since they are paired).
`x.size` returns the number of elements in the `x` array.
Since `x` and y are paired arrays, `N` represents the total number of data points used in the linear regression model.
For the given data, `x` and y both have `5` elements, so `N` will be `5`. The size `N` is used in the calculation of the loss function and the gradient, ensuring that these computations are normalized by the number of data points.
These parameters are essential for controlling the behavior of the gradient descent algorithm and ensuring that it converges to the optimal solution efficiently.
Python
for i in range(iterations):
# Calculate predictions
y_pred = a * x
# Compute loss (Mean Squared Error)
loss = (1/N) * np.sum((y_pred - y) ** 2)
losses.append(loss)
# Compute gradient
gradient = (2/N) * np.sum(x * (y_pred - y))
# Update parameter 'a'
a = a - learning_rate * gradient
a_values.append(a)
# Print loss and 'a' every 10 iterations
if (i+1) % 10 == 0:
print(f"Iteration {i+1}: Loss = {loss:.4f}, 'a' = {a:.4f}")
This is the core part of the code where the gradient descent algorithm is implemented.
Python
y_pred = a * x
In this part of the code, we calculate the predictions of our linear regression model. Let's break down what this line does and its significance in the context of linear regression.
In linear regression, we model the relationship between the independent variable `x` and the dependent variable `y` with a linear equation. For simple linear regression, the equation is:
`y = ax` where:
`y` is the predicted value.
`a` is the slope (parameter) of the line.
`x` is the independent variable.
Here, `y_pred` represents the predicted values of `y` based on the current value of the parameter `a` and the input data `x`.
Our goal is to find the optimal value of `a` that best fits the data points in `x` and `y`.
By multiplying `a` with `x`, we compute the predicted `y` values for each data point in `x`.
The line `y_pred = a * x` is crucial because it generates the model's current predictions based on the latest value of the parameter `a`. This step is repeated in each iteration of the gradient descent loop, allowing the model to progressively refine its parameter `a` and improve its predictions by minimizing the loss function.
Python
loss = (1/N) * np.sum((y_pred - y) ** 2)
losses.append(loss)
In this part of the code, we calculate the Mean Squared Error (MSE), which is a common metric used to evaluate the performance of regression models. Let's break down what each line does and its significance in the context of linear regression.
The Mean Squared Error is a measure of how close the predicted values (in this case, `y_pred`) are to the actual values (`y`). It is defined as the average of the squared differences between the predicted and actual values.
Squared Error Calculation:
Python
squared_errors = (y_pred - y) ** 2
Here, `y_pred` is the array of predicted values, and `y` is the array of actual values. `y_pred - y` computes the differences between each predicted and actual value. Squaring these differences `(y_pred - y) ** 2` ensures that all errors are `positive` and gives more weight to larger errors.
Summation:
Python
sum_squared_errors = np.sum(squared_errors)
The `np.sum()` function calculates the sum of all squared errors. This sum represents the total squared error across all data points.
Mean Calculation:
Python
loss = (1/N) * sum_squared_errors
To obtain the Mean Squared Error, we divide the sum of squared errors by the number of data points `N`. This normalization step ensures that the MSE is independent of the number of data points, making it easier to compare across different datasets or models.
Adding Loss to List:
Python
losses.append(loss)
After computing the MSE for the current iteration, we append the value of loss to the losses list. This allows us to track how the loss changes over each iteration of the gradient descent algorithm.
The lines `loss = (1/N) * np.sum((y_pred - y) ** 2)` and `losses.append(loss)` together compute and store the Mean Squared Error for each iteration of the gradient descent loop. This metric quantifies how well our linear regression model fits the training data and guides the optimization process towards finding the optimal parameter `a`.
Python
# Compute gradient
gradient = (2/N) * np.sum(x * (y_pred - y))
In this part of the code, we calculate the gradient of the loss function with respect to the parameter `a`. Let's break down what this line does and its significance in the context of gradient descent for linear regression.
Gradient descent is an optimization algorithm used to minimize the loss function (in this case, Mean Squared Error, MSE) by iteratively adjusting the model parameters (in this case, `a`).
Error Calculation:
Python
errors = y_pred - y
Here, `errors` represents the difference between the predicted values `y_pred` and the actual values `y`. These errors quantify how much our current model is deviating from the actual data.
Weighted Sum of Errors:
Python
weighted_errors = x * errors
The expression `x * errors` computes a vector where each element is the product of the corresponding element in `x` and `errors`. This step applies the error to each data point scaled by its corresponding feature value.
Summation and Scaling:
Python
sum_weighted_errors = np.sum(weighted_errors)
gradient = (2/N) * sum_weighted_errors
`np.sum(weighted_errors)` calculates the sum of all elements in `weighted_errors`, which effectively computes the dot product of `x` and `errors`.
`(2/N)` scales this sum by a factor of `2/N`, where `N` is the number of data points. This scaling is derived from the derivative of the Mean Squared Error loss function with respect to `a`, ensuring that the gradient points in the direction of steepest descent towards minimizing the loss.
Let's illustrate with a simplified example:
Suppose `x = [1, 2, 3]` and `y = [3, 6, 9]`.
After initializing `a` and predicting `y` using `y_pred=ax`, let's say `a=2`.
Then, `y_pred = [2, 4, 6]`.
Now, calculate the gradient:
Compute `errors = y_pred - y = [2 - 3, 4 - 6, 6 - 9] = [-1, -2, -3]`.
Compute `weighted_errors = x * errors = [1 * (-1), 2 * (-2), 3 * (-3)] = [-1, -4, -9]`.
`sum_weighted_errors = np.sum([-1, -4, -9]) = -14`.
If `N=3`, then `gradient = (2/3) * (-14) = -9.33`.
The line `gradient = (2/N) * np.sum(x * (y_pred - y))` computes the gradient of the Mean Squared Error loss function with respect to the parameter `a` for linear regression. This gradient guides the iterative process of adjusting `a` to improve the model's fit to the data, ultimately converging towards the optimal solution where the loss is minimized.
Python
a = a - learning_rate * gradient
a_values.append(a)
Gradient descent is an optimization algorithm used to minimize a function iteratively by adjusting its parameters. Here, the goal is to find the optimal value of parameter a that minimizes the Mean Squared Error (MSE) between the predicted `y` values (`y_pred`) and the actual `y` values (`y`).
Variables:
`a`: Represents the current value of the parameter `a` being optimized.
`learning_rate`: Determines the size of steps taken in the direction of the negative gradient. It's a hyper-parameter that needs to be tuned depending on the problem.
`gradient`: Denotes the gradient of the loss function with respect to `a`. It indicates the direction and magnitude of the steepest descent.
Update Rule:
Python
a = a - learning_rate * gradient
The line `a = a - learning_rate * gradient` implements the update rule for a during each iteration of gradient descent:
`gradient`: Represents the slope of the loss function at the current value of `a`. A positive gradient indicates that increasing a would increase the loss, so we adjust a downwards.
`learning_rate * gradient`: Determines the size of the step we take in the opposite direction of the `gradient` to reduce the loss.
`a - learning_rate * gradient`: Updates `a` by subtracting the product of `learning_rate` and `gradient` from the current value of `a`. This step moves a towards the direction of lower loss.
Appending a Values:
Python
a_values.append(a)
`a_values.append(a)`: Stores the updated value of a in the list `a_values` after each iteration. This list is used later for analysis, such as plotting the convergence of a over iterations.
The purpose of these lines of code is to iteratively adjust the parameter `a` to minimize the loss function (MSE in this case) using the gradient descent algorithm. By updating `a` in the direction opposite to the `gradient` scaled by `learning_rate`, the algorithm moves towards the optimal value of `a` that best fits the given data (`x`, `y`).
These lines of code encapsulate the core mechanism of gradient descent optimization, crucial for training linear regression models and many other machine learning algorithms. The iterative adjustment of a based on the `gradient` of the loss function enables the model to progressively improve its fit to the training data.
Python
if (i+1) % 10 == 0:
This line checks if the current iteration number `i+1` is divisible by `10` (`i+1 modulo 10 equals 0`). This ensures that the code inside the if block is executed every `10` iterations.
Python
print(f"Iteration {i+1}: Loss = {loss:.4f}, 'a' = {a:.4f}")
Outputs a formatted string that displays:
`Iteration {i+1}`: Displays the current iteration number (adjusted for human readability by adding `1` to `i`).
`Loss = {loss:.4f}`: Shows the current value of the loss function (`loss`) formatted to four decimal places (`{loss:.4f}`).
`'a' = {a:.4f}`: Displays the current value of the parameter a formatted to four decimal places (`{a:.4f}`).
Python
print(f"Optimal parameter 'a': {a:.4f}")
After completing all iterations, we print the final optimized value of the parameter `a`.
Python
plt.figure(figsize=(14, 5))
# Plot the original data and the regression line
plt.subplot(1, 2, 1)
plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, a*x, color='red', label=f'Linear regression: y = {a:.4f}x')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()
# Plot the cost function
plt.subplot(1, 2, 2)
plt.plot(range(iterations), losses, color='purple')
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function')
plt.tight_layout()
plt.show()
This code segment uses Matplotlib, a popular plotting library in Python, to visualize the results of linear regression and the optimization process (gradient descent) through plots.
Implementing linear regression with gradient descent in Python provides a solid foundation for understanding both the mechanics of the algorithm and its practical application in predictive modeling. By following this tutorial, you have taken a significant step towards mastering fundamental concepts in machine learning, setting the stage for exploring more advanced techniques like neural networks.
Published: June 29, 2024
Have a question or suggestion? Want to request a tutorial or simply leave me a message? I'd love to hear from you! Join our community on Discord for exclusive content, engaging discussions, and more. Thank you! 🌟