Interactive Decision Tree Builder – CART Algorithm from Scratch
Build classification trees on your own CSV data and watch the greedy recursive splitting process create interpretable decision rules. Upload a dataset, adjust hyperparameters, and visualize the resulting tree structure with node impurity color-coding and interactive path highlighting.
Technical implementation:
CART algorithm (Classification and Regression Trees) with greedy top-down induction.
Dual impurity measures: Gini impurity (1 - Σp²) and Shannon entropy (-Σp·log₂p).
Numeric feature splits: Evaluates midpoints between sorted unique values, uses cumulative label counting for O(n log n) split finding.
Categorical feature splits: One-vs-rest binary partitioning for each category value.
Automatic type inference: Distinguishes numeric vs categorical features based on value distribution (70% numeric threshold).
Standard stopping criteria: Max depth, min samples per split, min impurity decrease.
Why the implementation matters: Most decision tree tutorials just call sklearn.DecisionTreeClassifier(). This shows you understand:
Information gain calculation: IG = I(parent) - Σ(n_child/n_parent)·I(child).
Why midpoint thresholds work for numeric splits.
How to handle missing values (ignored per split, not imputed).
The computational cost of exhaustive split search.
Interactive features:
D3.js tree layout with automatic node positioning.
Path highlighting: Click "Inspect a row" to see which splits a specific example traverses.
Node tooltips: Hover to see split rule, information gain, sample counts, class distribution.
Color-coded impurity: Green (pure) to red (mixed) gradient.
Export capabilities: SVG for presentations, JSON for model serialization.
Zoom/pan: Explore large trees interactively.
Dataset flexibility:
CSV upload or paste.
Automatic header detection.
Handles mixed numeric/categorical features.
Works with multi-class targets.
Dataset preview table.
Hyperparameter tuning: Adjust max depth, min samples split, and min impurity decrease to observe overfitting vs underfitting trade-offs. The visualization makes it immediately clear when a tree is too shallow (underfits) or too deep (creates single-sample leaves).
Educational value:
For explaining to stakeholders: "Here's exactly why the model classifies this customer as high-risk" (show the path).
For debugging: See which features are actually being used (top splits).
For feature engineering: Notice when a numeric feature gets split at unexpected thresholds.
Comparison to production trees: This implementation uses the same core algorithm as sklearn's DecisionTreeClassifier but with:
Explicit cumulative counting for numeric splits (not C++ optimized).
One-vs-rest categorical handling (sklearn uses multi-way splits internally for some tree types).
Simplified pruning (only pre-pruning via hyperparameters)..