By the end of this lab, you will be able to:
Construct datasets with categorical features
Visualise binary classification data
Use scikit-learn to train a logistic regression model
Predict outcomes for new samples
Evaluate how predictions relate to the original data
Run this only once in your environment.
pip install scikit-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
We will use the cat data set again. It can be tricky to visualise this type of data. Since there are 3 features, we can make a 3D scatter plot:
model = LogisticRegression()
model.fit(X, y)
Use your model to predict a value for a new x.
# New animal: 4 legs, whiskers, claws
X_new = np.array([[1, 1, 1]])
y_pred = model.predict(X_new)
y_prob = model.predict_proba(X_new)
print(f"Prediction: {'Cat' if y_pred[0] == 1 else 'Not a Cat'}")
print(f"Probability of being a cat: {y_prob[0][1]:.2f}")
Logistic regression doesn’t say "definitely cat" or "definitely not cat". Instead, it calculates a probability based on the linear combination of features and applies the sigmoid function. In the data above there are only two positive examples. How could we improve the accuracy?
Modify the Input
Change the new input to different combinations of features and see how the predictions change.
Examples to try:
[0, 1, 1]
[1, 0, 1]
[1, 1, 0]
Add New Training Data
Add more samples to your X and y arrays.
Try to build a more balanced dataset and observe if predictions improve.
Predict Multiple Samples
X_test = np.array([
[1, 0, 0],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
])
y_test_pred = model.predict(X_test)
print("Predictions:", y_test_pred)
Which ones are predicted as cats?
Do these predictions make sense based on the training data?
Try some of your own data sets
E.g. Is this lunch healthy?
Contains fruit
Contains vegetables
Contains sugary drink
Contains processed snack
Choose a topic you're interested in (sports, food, school, pets, etc.)
Brainstorm 3–5 relevant yes/no questions (these become features)