Distances and Similarity
Also see: Using Distance Metrics
Distance and similarity measures are foundational ML metrics. They quantify how closely related two data points or embedded vectors are. These calculations form the backbone of algorithms such as K-Means, KNN, and clustering.
Common metrics include Euclidean distance (straight-line), Manhattan distance (absolute difference sum), and Cosine similarity (angle-based). Smaller distances indicate higher similarity.
In unsupervised learning, these measures allow algorithms like K-Means to group similar data points into clusters, while in supervised learning, K-Nearest Neighbors (KNN) uses them to classify new data based on proximity to labeled examples.
Top Distance and Similarity Metrics
Different data types and algorithms require specific metrics to produce meaningful results:
Euclidean Distance: The most common straight-line distance between two points, used frequently when data is continuous and numerical.
Manhattan Distance: Also known as Taxicab or City Block distance, this calculates the absolute sum of differences, making it effective for high-dimensional data or grid-like layouts, as it is less sensitive to outliers.
Cosine Similarity: Measures the cosine of the angle between two vectors, focusing on orientation rather than magnitude. It is ideal for text mining and recommendation systems where document length or frequency differs.
Minkowski Distance: A generalized formula that includes both Euclidean and Manhattan distances, allowing for flexibility in tuning algorithms.
Hamming Distance: Measures the number of positions at which corresponding elements differ, commonly used for categorical or binary data.
Jaccard Similarity: Measures the intersection divided by the union of two sets, used for measuring similarity in set-based or binary data.
Importance of Distances in ML Algorithms
1. K-Means Clustering: Relies on minimizing the distance between data points and their respective cluster centroids (usually using Euclidean distance) to form distinct groups.
2. K-Nearest Neighbors (KNN): Determines the label of a new data point by finding the closest labeled points in the feature space, making it sensitive to the chosen metric.
3. Dimensionality Reduction: Techniques such as t-SNE or UMAP rely on maintaining distance relationships between data points to visualize high-dimensional data in lower dimensions.
Crucial Considerations
Feature Scaling: Because distance metrics are sensitive to the scale of data (e.g., age vs. income), it is essential to normalize or standardize data (e.g., using MinMaxScaler or StandardScaler) before computing distances.
Data Type: Using the wrong metric (e.g., Euclidean for categorical data) can lead to poor model performance.
Choosing the Right Metric
Continuous numerical data: Euclidean is often preferred.
High-dimensional/sparse data: Cosine is usually better.
Categorical/binary data: Hamming is appropriate.
Note The Curse of Dimensionality
As the number of dimensions increases, the behavior of space changes in ways that are highly counter-intuitive. This is often called the Curse of Dimensionality.
In high-dimensional spaces, distance metrics can become less effective because the distance between any two points tends to converge, making traditional measures less meaningful.
Data Sparsity in high dimensions, data becomes incredibly sparse. By the time you reach 100 dimensions, those 10 points are essentially lost in a vast vacuum.
Key takeaway: To maintain the same density of data as you add dimensions, the amount of data needed grows exponentially.
The Death of Distance
In 2D or 3D, some points are close and others are far. However, in very high-dimensional space, the distance between any two random points starts to become almost exactly the same.
Because every point is roughly the same distance from every other point, concepts like nearest neighbors (which many ML algorithms rely on) become much less effective.Standard Euclidean distance (d = \sqrt{\sum (x_i - y_i)^2}) becomes a poor way to measure distances.
High Dimensions in Modern AI while high dimensions are a headache for traditional statistics, they are the playground for modern AI:Image Recognition: A small 224 \times 224 pixel color image lives in a space with 150,528 dimensions (each pixel's Red, Green, and Blue values).
Word Embeddings (LLMs): When an AI processes a word, it represents that word as a vector in a space typically ranging from 768 to 4096 dimensions. This allows the model to capture subtle nuances, one dimension might represent gender, another royalty, another verb tense.
How We Handle It
Since humans can't visualize 100 dimensions, we use Dimensionality Reduction. Techniques like PCA (Principal Component Analysis) or UMAP act like a camera, taking a shadow or projection of the high-dimensional data and squashing it down into 2D or 3D so we can see the clusters and patterns.
Low Dimensions: Points are distributed like clusters in a room; some are close, some are far.
High Dimensions: Every point starts to look like it is the same distance from every other point. This makes algorithms like K-Nearest Neighbors (KNN) or K-Means Clustering struggle because closeness loses its meaning.
Volume and the Empty Space Phenomenon
As dimensionality increases, the volume of the space grows exponentially, causing distances to behave strangely regarding boundaries.
The Crust Effect: In high dimensions, almost all the volume of a sphere is located in a thin shell near its surface. If you pick two random points in a high-dimensional cube, they are almost guaranteed to be near the edge and far from each other.
Distance to the Origin: Most points in high-dimensional space end up being roughly the same distance from the center. Instead of filling the space uniformly, data points tend to migrate toward the corners of the hypercube.
The choice of distance metric becomes a critical factor as dimensionality rises.
In real-world ML workflows, the right distance measure is crucial; for example, cosine similarity often works better than Euclidean when document lengths differ, as it focuses on word frequencies rather than the sheer size of the document.