Models Implemented
Kmeans Clustering
Objective: To identify and group states with similar tax characteristics across different categories (e.g., individual, business, property taxes) in an unsupervised manner. This allows us to explore hidden patterns in tax policy structures and detect natural groupings without needing labeled data.
Why chosen: Suitable for grouping states by tax features without labeled targets.
Assumptions: Clusters are spherical; Euclidean distance is meaningful; data should be standardized.
Hyperparameter tuning: Tried k = 2, 3, 4; selected k = 2 with highest silhouette score (0.078).
Challenges & solutions:
Mixed symbols → cleaned values
Missing data → mean imputation
MKL warning on Windows → set OMP_NUM_THREADS = 1
Apriori Analysis
Objective: Extract "frequently occurring combinations of tax systems" (e.g. high income tax rate + high excise tax) from tax data to discover meaningful rules.
Why chosen: Used to find frequent combinations of categorical tax features after discretizing numerical values.
Assumptions: Requires one-hot encoded transactions and assumes frequent subsets imply frequent supersets.
Hyperparameter tuning: Tuned min_support = 0.1–0.15, min_confidence = 0.4, lift ≥ 0.7, and limited max_len = 2 for speed.
Challenges & solutions:
Numeric → High/Low categories
Long-format → Pivoted + one-hot
Long runtime → Narrowed parameters
Sparse patterns → Relaxed thresholds
Regression (Linear & Ridge)
Objective: To predict the "State and Local Tax Burden" based on various tax-related features (e.g., income tax rate, sales tax rate, property tax per capita), using linear regression and regularized regression models (Ridge). The goal is to understand how well these tax indicators can explain the overall tax burden and to evaluate whether regularization improves prediction or model robustness.
Why chosen: Linear and Ridge regression fit well for predicting a continuous outcome (Tax Burden) from numeric tax features.
Assumptions: Linearity, independence, constant variance, normal residuals; Ridge adds multicollinearity control.
Hyperparameter tuning: RidgeCV tuned alpha ∈ [0.01, 0.1, 1, 10, 100]; best = 0.01.
Challenges & solutions:
Missing data → filled with column means
Small data → Ridge used to avoid overfitting
Decision Tree Classifier
Objective: To classify each state-tax group as having a High or Low Tax Burden, based on features such as income tax rate, sales tax, and other fiscal indicators. The goal is to understand which tax characteristics are associated with higher overall tax pressure, using an interpretable model like a Decision Tree.
Why chosen: Simple, interpretable model for small datasets with numeric features. Used to classify High vs. Low Tax Burden.
Assumptions: No need for linearity or feature scaling.
Tuning: Manually searched parameters. Best: max_depth=2, min_samples_split=2, criterion='gini'
Challenges & solutions:
Very small class (Low Tax) → no cross-validation
Small data → full-dataset evaluation
Prevented overfitting with shallow tree
Performance Evaluation
Kmeans Clustering
Evaluation Metrics: Silhouette Score (used to compare clustering quality).
Results: See this page.
Apriori Analysis
Evaluation Metrics: Support, Confidence, Lift
Results: See this page.
Regression (Linear & Ridge)
Evaluation Metrics: MSE, R²; both models achieved R² = 1.0
Results: See this page.
Decision Tree Classifier
Evaluation Metrics: Accuracy, Precision, Recall, F1 = 1.00 (training data)
Results: See this page.
Conclusion
Best Model for Comparing Colorado and Utah
Regression (Linear & Ridge) performed best.
Accurately predicted Tax Burden using tax indicators.
Achieved R² = 1.0, showing strong linear relationships and robustness.
Other Models
KMeans Clustering:
Offered exploratory grouping of tax structures.
Low silhouette scores indicated weak cluster separation.
Decision Tree Classification:
Perfect accuracy, but overfitted due to class imbalance.
Not reliable for generalization.
Apriori Analysis:
No strong rules found due to small, sparse data.
Limited to simple pattern exploration.
Conclusion
Regression best supported the comparison goal.
Others added insights but were constrained by data limitations.
Data Formatting for Models
Kmeans Clustering
Data formatting: Pivoted tax items by state/group; applied standard scaling.
Before-and-after data transformation snapshots: See this page.
Apriori Analysis
Data formatting: Cleaned values, categorized, and one-hot encoded for Apriori
Before-and-after data transformation snapshots: See this page.
Regression (Linear & Ridge)
Data formatting: Cleaned Value, pivoted to wide format, filled NAs, scaled features
Before-and-after data transformation snapshots: See this page.
Decision Tree Classifier
Data formatting: Cleaned %/$, pivoted, filled NaNs, binarized target
Before-and-after data transformation snapshots: See this page.