NB: All the datasets and analyses used in this project are available on GitHub. You can access them by clicking the GitHub icon at the end
This project applies interpretable machine learning techniques to the Breast Cancer Wisconsin (Diagnostic) dataset with the aim of identifying the most informative tumor characteristics associated with malignancy. Rather than focusing solely on predictive accuracy, the analysis emphasizes feature interpretability and clinical relevance, allowing insights to be directly aligned with established medical diagnostic practices.
The dataset consists of quantitative measurements derived from digitized images of breast tumor cell nuclei, capturing tumor size, texture, and morphological features. A logistic regression model was trained to distinguish malignant tumors from benign ones, and feature importance analysis was conducted to determine which measurements most strongly influence classification outcomes.
The key objective of the project is to evaluate whether localized tumor abnormalities, particularly shape irregularities at the tumor’s most extreme regions, provide stronger diagnostic signals than average tumor measurements. The findings demonstrate that worst-case shape descriptors, such as concavity, concave points, and asymmetry, play a dominant role in malignancy prediction, reinforcing their clinical significance in breast cancer diagnosis.
The analysis was conducted using the Breast Cancer Wisconsin (Diagnostic) dataset, which contains 569 tumor samples derived from digitized images of fine needle aspirates of breast masses. Each sample includes 30 numerical features describing tumor morphology, size, texture, and boundary characteristics.
A Logistic Regression model was selected due to its interpretability and suitability for binary classification tasks in clinical contexts. Unlike more complex black-box models, logistic regression allows direct examination of feature coefficients, making it possible to understand how individual tumor characteristics influence malignancy prediction.
The model was trained on the full feature set, and feature importance was evaluated using standardized coefficient magnitudes.
Analysis of model coefficients revealed that the most influential predictors of malignancy were worst-case morphological features, particularly those describing tumor shape irregularity. These included:
Concave points (worst)
Concavity (worst)
Symmetry (worst)
Radius (worst)
Texture (worst)
In contrast, mean or average feature values consistently demonstrated lower predictive influence.
This indicates that localized extreme abnormalities carry greater diagnostic significance than global or average tumor characteristics.
Tumors classified as malignant exhibited significantly higher values for concavity, concave points, and asymmetry. These features reflect jagged, spiculated, and uneven tumor boundaries, which are commonly associated with aggressive tumor behavior.
Worst-case size features, such as radius and area at the most extreme region, were more informative than average size measures. This suggests that malignant tumors may not be uniformly large, but often contain focal regions of abnormal growth that dominate diagnostic outcomes.
The results indicate that breast cancer malignancy is more closely associated with structural distortion and localized boundary irregularities than with overall tumor size. A single highly abnormal region within a tumor can substantially increase malignancy likelihood, even when the remainder of the tumor appears relatively uniform.
Diagnostic evaluation should place strong emphasis on tumor boundary morphology, particularly:
Irregular or jagged margins
Spiculations and concave distortions
Poor symmetry
The analysis shows that shape irregularity is a stronger indicator of malignancy than overall tumor size, especially when abnormalities are localized.
Clinical assessments should prioritize the most abnormal region of a tumor, rather than relying solely on average measurements.
Worst-case features such as:
Maximum concave points
Worst-case radius and area
Localized asymmetry
provide greater diagnostic value than mean or global measurements.
Imaging techniques should be optimized to capture:
High-resolution tumor margins
Fine boundary irregularities
Localized distortions in shape
This supports early detection of malignancy that may not yet present as a uniformly large mass.
Risk stratification models and diagnostic checklists should explicitly incorporate shape-based indicators, including:
Boundary concavity
Asymmetry
Irregular growth patterns
These indicators may help differentiate aggressive tumors from benign growths earlier in the diagnostic process.
This analysis highlights that the most reliable indicators of breast cancer malignancy are localized tumor shape irregularities, rather than average tumor size or global measurements. The model consistently identified worst-case morphological features, particularly concavity, concave points, asymmetry, and extreme boundary distortion, as the strongest contributors to malignancy classification. These findings emphasize that a single highly abnormal region within a tumor can carry greater diagnostic significance than overall tumor appearance.
Based on these insights, diagnostic assessment should prioritize careful evaluation of tumor borders and focal irregularities, especially regions exhibiting spiculations, asymmetry, or sharp concave structures. Imaging and screening practices may benefit from enhanced attention to worst-case regions, even when tumors appear moderate in size or largely uniform.
The results reinforce established clinical principles while providing quantitative evidence to support shape-based diagnostic criteria. Interpretable machine learning models can therefore serve as effective decision-support tools, helping clinicians identify high-risk morphological patterns with greater confidence and consistency.