Decision Trees

ML Decision Trees (Gemini)

A Decision Tree is a supervised machine learning algorithm that can be used for both classification and regression tasks.

It's called a "tree" because it breaks down data into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.

The final result is a tree-like structure with nodes and branches.

Goal: Create a model that predicts the value of a target variable by learning decision rules inferred from the data features.

How it works

Start at the Root: The tree begins with a single node representing the complete dataset.

Splitting: The algorithm looks for the "best" way to split the data into two or more subsets based on a feature and a threshold (for numerical features) or a category (for categorical features).

"Best" typically means maximizing information gain or minimizing impurity (e.g., Gini impurity, entropy).

Recursive Process: Splitting process is repeated recursively for each new node until a stopping condition is met (e.g., maximum depth reached, minimum samples per leaf, no further information gain).

Leaf Nodes: The terminal nodes where no further splitting occurs are called "leaf nodes." Each leaf node represents a class prediction (for classification) or an average value (for regression).

Splitting: The process of dividing a node into two or more sub-nodes.

Branch/Edge: A connection between nodes.

Parent/Child Node: A node that is split is the parent, and the resulting sub-nodes are children.

Impurity: A measure of the homogeneity of the labels within a node. The goal is to reduce impurity with each split. Common measures are Gini impurity and entropy.

Information Gain: The reduction in entropy or Gini impurity after a dataset is split on an attribute. The attribute with the highest information gain is chosen for splitting.

Advantages:

Easy to understand and interpret.

Requires little data preparation.

Can handle both numerical and categorical data.

Relatively fast for prediction.

Disadvantages:

Prone to overfitting (especially deep trees).

Can be unstable (small changes in data can lead to a very different tree).

Might not be as accurate as more complex algorithms for some problems.

Like a series of "if-else" statements.

Page updated

Google Sites

Report abuse