Introduction:
In this project, I explore how feature selection, model interpretability, and evaluation choices shape text classification systems. Using email spam detection as a case study, I examined how different words contribute to classification decisions, how adjusting thresholds affects false positives and false negatives, and how ambiguity in labeled data complicates the idea “ground truth.”
A central theme of this project was the exploration of Machine Learning techniques. Rather than treating the model’s output as definitive, I investigated why certain features mattered, how small changes could flip classifications, and what tradeoffs emerge when models are deployed in real-world contexts like content moderation. This approach highlights why data analysis is not just about improving metrics, but about understanding model behavior and consequences.
Correlation heatmap of word features and the spam label
fig, ax = plt.subplots(figsize=(8, 6))
words = ["free","face", "width", "size", "click", "%", "email", "font", "order", "3d", "href", "url", "guarantee"]
corr_features = words_in_texts(words, train["email"])
word_df = pd.DataFrame(corr_features, columns=words)
word_df["spam"] = train["spam"].values
corr_matrix = word_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", annot_kws={'size': 10})
plt.title("Correlation of Words with each other and with Spam");
Figure 1. This heatmap displays correlations between selected word features and the spam label, as well as correlations among the words themselves.
The visualization revealed that the word “font” had the strongest correlation with spam emails, while words such as “free” had much weaker correlations. This insight helped guide feature selection by identifying which words were actually informative and which were not, challenging initial assumptions about spam language.
Improving the Model Through Feature Exploration
I tested different word features by incorporating them into exploratory analysis and observing how they affected model performance. This process revealed that many words appeared very rarely in both spam and ham emails, limiting their usefulness as predictive features. I was surprised to find that even commonly associated spam words did not consistently distinguish spam from non-spam. This reinforced the idea that better performance often comes from better features, not more complex models.
ROC Curve and Threshold Tradeoffs
Y = model.predict_proba(X_train)[:, 1]
fp, tp, threshold = roc_curve(Y_train, Y)
plt.plot(fp, tp)
plt.plot([0,1], [0,1], "--", label = "x=y")
plt.legend()
plt.xlabel("False Positives")
plt.ylabel("True Positives")
plt.title("ROC Curve for Classifier");
Figure 2. The ROC curve illustrates the tradeoff between true positives and false positives across different probability thresholds.
This visualization highlights that classification is not a binary decision but a continuum. Adjusting the threshold allows the model to prioritize catching more spam at the cost of misclassifying legitimate emails, or vice versa. This tradeoff mirrors real-world decision-making in domains like medical screening and content moderation.
Interpretability and Ambiguous Labels
I examined how removing a single feature could flip an email’s classification and reflected on how ambiguity in spam/ham labels affects model evaluation.
This analysis showed that simple models can be highly interpretable, making it easier to understand why predictions change. At the same time, it emphasized that labeled data is not always an objective ground truth—different people may reasonably disagree on whether an email is spam. This ambiguity directly impacts how we interpret model performance and fairness.
This project reinforced that building a classification model is only the beginning; understanding why it behaves the way it does is where meaningful insight emerges. Through feature exploration, correlation analysis, and threshold tuning, the model evolved from a simple classifier into a lens for examining how assumptions about language, labels, and evaluation shape outcomes.
One of the most important lessons was how easily intuition can be challenged by data. Words that seemed like obvious spam indicators turned out to be weak signals, while less obvious features carried stronger predictive power. Similarly, small changes—such as removing a single feature or adjusting a probability threshold—were enough to flip classifications, revealing how sensitive models can be and how critical interpretability is in understanding those decisions.
This project also highlighted that data is rarely a perfect reflection of truth. Ambiguity in spam labels showed that disagreement is often inherent in human-generated data, and that models ultimately learn patterns shaped by these subjective decisions. As a result, evaluating performance requires more than optimizing metrics; it requires considering tradeoffs, consequences, and context.
Overall, this analysis reflects why I enjoy data science: curiosity-driven exploration transforms models from black boxes into understandable systems. By asking deeper questions about features, evaluation, and interpretation, data analysis becomes a process of learning—not just about the data, but about the systems and people affected by it.
In this project, I explored how classification models behave when applied to real-world data, focusing not only on predictive performance but also on how errors are distributed across different groups. While metrics such as accuracy can summarize overall performance, they often fail to capture who is helped or harmed by a model’s predictions.
What made this project especially engaging was the opportunity to explore unfamiliar questions through data. Curiosity guided the analysis—from examining model outputs and decision thresholds to investigating how evaluation metrics shape conclusions about fairness. This reflects why I enjoy data analysis: it allows me to continuously learn new concepts while uncovering structure in complex systems.
Through this project, I demonstrate an end-to-end workflow that includes exploratory analysis, model training, evaluation, and interpretation, with an emphasis on understanding model behavior in context rather than relying on a single performance number.
train = train.reset_index(drop=True) # We must do this in order to preserve the ordering of emails to labels for words_in_texts.
plt.figure(figsize=(8,6))
new_words = ["free", "head", "%", "email", "font", "order"]
wordsdf = pd.DataFrame(words_in_texts(new_words, train["email"]), columns = new_words)
wordsdf["spam"] = train["spam"]
melted = wordsdf.melt(id_vars="spam", var_name="variable", value_name="value")
props = melted.groupby(["spam", "variable"]).mean().reset_index()
props["spam"] = props["spam"].replace({0 : "Ham", 1: "Spam"})
sns.barplot(props, x="variable", y="value", hue="spam")
plt.ylabel("Proportion of Emails")
plt.xlabel("Words")
plt.title("Frequency of Words in Spam/Ham Emails")
plt.ylim(0,1)
plt.gca().legend().set_title('')
plt.tight_layout()
plt.show()
Figure 3. Frequency of selected words in spam and ham emails. This bar chart shows how often certain words appear in each class, highlighting differences in language usage between spam and non-spam emails.
zero_predictor_fp = 0
zero_predictor_fn = sum(Y_train == 1)
zero_predictor_fp, zero_predictor_fn
(0, 1918)
Figure 4. Performance summary of a zero predictor model that labels every email as ham.
This comparison shows that while logistic regression only slightly improves accuracy over the zero predictor, it is still meaningfully better because it identifies some spam emails correctly.
my_model = LogisticRegression()
my_model.fit(X_train, Y_train)
training_accuracy = my_model.score(X_train, Y_train)
print("Training Accuracy: ", training_accuracy)
zero_predictor_acc = sum(Y_train == 0) / len(Y_train)
zero_predictor_recall = 0
zero_predictor_acc, zero_predictor_recall
Training Accuracy: 0.7576201251164648
Prediction Accuracy: 0.7447091707706642
Figure 5. Accuracy comparison between a zero predictor and a logistic regression classifier.
This comparison shows that while logistic regression only slightly improves accuracy over the zero predictor, it is still meaningfully better because it identifies some spam emails correctly.
Y_train_hat = my_model.predict(X_train)
TP = sum((Y_train_hat == 1) & (Y_train ==1) )
TN = sum( (Y_train_hat ==0) & (Y_train == 0))
FP = sum( (Y_train_hat == 1) & (Y_train == 0))
FN = sum( (Y_train_hat == 0) & (Y_train == 1))
logistic_predictor_precision = (TP) / (TP + FP)
logistic_predictor_recall = TP / (TP + FN)
logistic_predictor_fpr = FP / (FP + TN)
print(f"{TP=}, {TN=}, {FP=}, {FN=}")
print(f"{logistic_predictor_precision=:.2f}, {logistic_predictor_recall=:.2f}, {logistic_predictor_fpr=:.2f}")
True Positive =219, True Negative =5473,
False Positive=122, False Negative=1699
logistic_predictor_precision = 0.64,
logistic_predictor_recall = 0.11,
logistic_predictor_fpr = 0.02
Figure 6. This block evaluates the performance of the logistic regression classifier by computing the components of the confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Using these values, it calculates three key evaluation metrics—precision, recall, and false positive rate.
Accuracy alone can be misleading in spam detection due to class imbalance. Precision measures how often emails predicted as spam are actually spam, while recall captures how effectively the model identifies spam emails. The false positive rate quantifies how often legitimate emails are incorrectly labeled as spam. Together, these metrics provide a more complete picture of model behavior and help evaluate the tradeoffs between blocking spam and avoiding the misclassification of legitimate messages.
Rather than relying on a single accuracy score, this project examines how a classification model behaves in practice by breaking down its predictions and errors. The word-frequency visualization reveals that many commonly assumed “spam indicators” appear infrequently in both spam and ham emails, which limits their usefulness as predictive features. This highlights an important lesson: model performance is often constrained by feature quality rather than the choice of algorithm.
To better understand prediction outcomes, the logistic regression model is evaluated using a confusion matrix and derived metrics such as precision, recall, and false positive rate. Precision helps quantify how often emails flagged as spam are truly spam, while recall captures how effectively the model identifies spam emails at all. The false positive rate provides insight into how often legitimate emails are incorrectly filtered, a key concern for user experience.
Comparing these results with a zero predictor emphasizes why accuracy alone is insufficient. Although the zero predictor achieves relatively high accuracy due to class imbalance, it completely fails to detect spam. The logistic regression model, while only modestly improving accuracy, demonstrates more practical value by correctly identifying some spam emails. Together, these analyses illustrate how thoughtful evaluation and curiosity-driven exploration uncover model limitations and tradeoffs that would otherwise remain hidden.