Models Implemented
- Tax Data -

Models Implemented
1. Kmeans Clustering
  - Objective: To identify and group states with similar tax characteristics across different categories (e.g., individual, business, property taxes) in an unsupervised manner. This allows us to explore hidden patterns in tax policy structures and detect natural groupings without needing labeled data.
  - Why chosen: Suitable for grouping states by tax features without labeled targets.
  - Assumptions: Clusters are spherical; Euclidean distance is meaningful; data should be standardized.
  - Hyperparameter tuning: Tried k = 2, 3, 4; selected k = 2 with highest silhouette score (0.078).
  - Challenges & solutions:
    1. Mixed symbols → cleaned values
    2. Missing data → mean imputation
    3. MKL warning on Windows → set OMP_NUM_THREADS = 1
2. Apriori Analysis
  - Objective: Extract "frequently occurring combinations of tax systems" (e.g. high income tax rate + high excise tax) from tax data to discover meaningful rules.
  - Why chosen: Used to find frequent combinations of categorical tax features after discretizing numerical values.
  - Assumptions: Requires one-hot encoded transactions and assumes frequent subsets imply frequent supersets.
  - Hyperparameter tuning: Tuned min_support = 0.1–0.15, min_confidence = 0.4, lift ≥ 0.7, and limited max_len = 2 for speed.
  - Challenges & solutions:
    1. Numeric → High/Low categories
    2. Long-format → Pivoted + one-hot
    3. Long runtime → Narrowed parameters
    4. Sparse patterns → Relaxed thresholds
3. Regression (Linear & Ridge)
  - Objective: To predict the "State and Local Tax Burden" based on various tax-related features (e.g., income tax rate, sales tax rate, property tax per capita), using linear regression and regularized regression models (Ridge). The goal is to understand how well these tax indicators can explain the overall tax burden and to evaluate whether regularization improves prediction or model robustness.
  - Why chosen: Linear and Ridge regression fit well for predicting a continuous outcome (Tax Burden) from numeric tax features.
  - Assumptions: Linearity, independence, constant variance, normal residuals; Ridge adds multicollinearity control.
  - Hyperparameter tuning: RidgeCV tuned alpha ∈ [0.01, 0.1, 1, 10, 100]; best = 0.01.
  - Challenges & solutions:
    1. Missing data → filled with column means
    2. Small data → Ridge used to avoid overfitting
4. Decision Tree Classifier
  - Objective: To classify each state-tax group as having a High or Low Tax Burden, based on features such as income tax rate, sales tax, and other fiscal indicators. The goal is to understand which tax characteristics are associated with higher overall tax pressure, using an interpretable model like a Decision Tree.
  - Why chosen: Simple, interpretable model for small datasets with numeric features. Used to classify High vs. Low Tax Burden.
  - Assumptions: No need for linearity or feature scaling.
  - Tuning: Manually searched parameters. Best: max_depth=2, min_samples_split=2, criterion='gini'
  - Challenges & solutions:
    1. Very small class (Low Tax) → no cross-validation
    2. Small data → full-dataset evaluation
    3. Prevented overfitting with shallow tree

Performance Evaluation
1. Kmeans Clustering
  - Evaluation Metrics: Silhouette Score (used to compare clustering quality).
  - Results: See this page.
2. Apriori Analysis
  - Evaluation Metrics: Support, Confidence, Lift
  - Results: See this page.
3. Regression (Linear & Ridge)
  - Evaluation Metrics: MSE, R²; both models achieved R² = 1.0
  - Results: See this page.
4. Decision Tree Classifier
  - Evaluation Metrics: Accuracy, Precision, Recall, F1 = 1.00 (training data)
  - Results: See this page.
5. Conclusion
Best Model for Comparing Colorado and Utah

Regression (Linear & Ridge) performed best.
- Accurately predicted Tax Burden using tax indicators.
- Achieved R² = 1.0, showing strong linear relationships and robustness.
- Other Models

KMeans Clustering:
- Offered exploratory grouping of tax structures.
- Low silhouette scores indicated weak cluster separation.
Decision Tree Classification:
- Perfect accuracy, but overfitted due to class imbalance.
- Not reliable for generalization.
Apriori Analysis:
- No strong rules found due to small, sparse data.
- Limited to simple pattern exploration.
- Conclusion

Regression best supported the comparison goal.
Others added insights but were constrained by data limitations.

Data Formatting for Models
1. Kmeans Clustering
  - Data formatting: Pivoted tax items by state/group; applied standard scaling.
  - Before-and-after data transformation snapshots: See this page.
2. Apriori Analysis
  - Data formatting: Cleaned values, categorized, and one-hot encoded for Apriori
  - Before-and-after data transformation snapshots: See this page.
3. Regression (Linear & Ridge)
  - Data formatting: Cleaned Value, pivoted to wide format, filled NAs, scaled features
  - Before-and-after data transformation snapshots: See this page.
4. Decision Tree Classifier
  - Data formatting: Cleaned %/$, pivoted, filled NaNs, binarized target
  - Before-and-after data transformation snapshots: See this page.

Page updated

Report abuse

Models Implemented- Tax Data -

Models Implemented
- Tax Data -