By the end of this lab, you will be able to:
Apply linear regression using scikit-learn.
Interpret predictions made by a trained regression model.
Before performing any regression, it’s important to understand the data visually.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([2, 4, 6, 8, 10, 12, 14, 16])
y = np.array([1, 3, 5, 7, 9, 11, 13, 15])
plt.figure(figsize=(8, 6))
plt.scatter(x, y, color='blue', label='Data Points')
plt.title('Scatter Plot of Linear Data')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True)
plt.legend()
plt.show()
Discussion:
This data appears to have a linear relationship — as x increases, so does y. The next step is to model this relationship using Linear Regression.
Scikit-learn is a popular Python library used for machine learning. It includes tools for training models like linear regression.
Run this only once in your environment.
pip install scikit-learn
We are importing the LinearRegression model from Scikit Learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
We are using the sama data as above.
x = np.array([2, 4, 6, 8, 10, 12, 14, 16])
y = np.array([1, 3, 5, 7, 9, 11, 13, 15])
model = LinearRegression()
model.fit(x, y)
Use your model to predict a value for a new x.
x_new = np.array([[4]])
y_pred = model.predict(x_new)
print(f"Predicted y for x = {x_new} is: {y_pred[0]}")
Try a new data set
x = np.array([[1], [3], [5], [7], [9], [11], [13], [15]])
y = np.array([2, 5, 7, 10, 12, 14, 17, 20])
Predict Multiple Values
Modify your program to predict 3 different x-values (e.g., x = 4, 10, 16) and plot all of them with a different marker.
Which predictions seem most reliable?
Are any outside the range of your data (extrapolation)?
Change the Data
Edit the y values slightly to make the data less perfectly linear.
E.g. y = np.array([2, 5, 6, 9, 13, 15, 16, 19])
Re-run the model and prediction.
Has the prediction changed much?
Does the predicted point still feel "correct" based on the new trend?
Outliers
Add an outlier to your data, like this:
x = np.append(x, [[100]]).reshape(-1, 1)
y = np.append(y, [250])
How does the predicted value change?
What does this tell you about the effect of extreme values on linear regression?