ML Decision Trees (Gemini)
A Decision Tree is a supervised machine learning algorithm that can be used for both classification and regression tasks.
It's called a "tree" because it breaks down data into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
The final result is a tree-like structure with nodes and branches.
Goal: Create a model that predicts the value of a target variable by learning decision rules inferred from the data features.
How it works
Start at the Root: The tree begins with a single node representing the complete dataset.
Splitting: The algorithm looks for the "best" way to split the data into two or more subsets based on a feature and a threshold (for numerical features) or a category (for categorical features).
"Best" typically means maximizing information gain or minimizing impurity (e.g., Gini impurity, entropy).
Recursive Process: Splitting process is repeated recursively for each new node until a stopping condition is met (e.g., maximum depth reached, minimum samples per leaf, no further information gain).
Leaf Nodes: The terminal nodes where no further splitting occurs are called "leaf nodes." Each leaf node represents a class prediction (for classification) or an average value (for regression).
Splitting: The process of dividing a node into two or more sub-nodes.
Branch/Edge: A connection between nodes.
Parent/Child Node: A node that is split is the parent, and the resulting sub-nodes are children.
Impurity: A measure of the homogeneity of the labels within a node. The goal is to reduce impurity with each split. Common measures are Gini impurity and entropy.
Information Gain: The reduction in entropy or Gini impurity after a dataset is split on an attribute. The attribute with the highest information gain is chosen for splitting.
Advantages:
Easy to understand and interpret.
Requires little data preparation.
Can handle both numerical and categorical data.
Relatively fast for prediction.
Disadvantages:
Prone to overfitting (especially deep trees).
Can be unstable (small changes in data can lead to a very different tree).
Might not be as accurate as more complex algorithms for some problems.
Like a series of "if-else" statements.