Cosine Similarity (Gemini)
Cosine similarity is a metric used to measure how similar two non-zero vectors are by calculating the cosine of the angle between them.
It focuses on the direction of the vectors not their magnitude. This makes it ideal for high-dimensional data, such as text analysis, recommendation systems, and AI embeddings.
More Applications: Commonly used in Natural Language Processing (NLP) to compare document similarity, search query relevance, and in machine learning to determine similarity between data objects.
In many scenarios, such as comparing document similarity based on word counts, the length of the document (magnitude) does not matter as much as the content (direction).
For example, a short document and a long document about the same topic might have very different magnitudes but similar directions, leading to a high cosine similarity score.
Euclidean distance measures the straight-line distance between two points, cosine similarity measures the angle between vectors.
Cosine similarity is particularly useful when the absolute magnitude of the vectors is less important than the relative proportions of their components.
The result of calculating the cosine of the angle between two vectors is a single scalar value between -1 and 1.
Whether the vectors are same, similar, or not similar depends directly on that resulting value.
Interpreting the Result
Same (Identical Direction): The result is 1. The vectors point in the exact same direction.
Highly Similar: The result is close to 1 (e.g., 0.85 or 0.99). The angle between them is very small.
Not Similar (Unrelated): The result is 0. The vectors are orthogonal (at a 90-degree angle), sharing no common direction.
Opposite (Inversely Similar): The result is -1. The vectors point in diametrically opposite directions.
Prepare a dataset for similarity matching
Vectors usually require preprocessing to ensure the calculation is mathematically valid and semantically meaningful.
Vectorization (For Non-Numeric Data) Raw data must be converted into numerical vectectors before any math can occur.
Categorical data: Converted via one-hot encoding or multi-hot encoding.
Dimensionality Alignment
Both vectors must have the exact same number of dimensions (length).
You cannot calculate the cosine similarity between a 3-dimensional vector and a 5-dimensional vector.
Missing dimensions must be padded with zeros.
Handling Missing Values (NaNs)
Any missing, null, or NaN values within the vectors must be resolved.
Drop the dimensions containing nulls across both vectors.
Impute the missing values (e.g., replacing NaNs with 0 or the dataset mean).
Zero-Vector Check
Ensure neither vector consists entirely of zeros (\([0, 0, 0]\)
The magnitude of a zero vector is 0.
Dividing by magnitude in the formula will cause a division-by-zero error.
What is NOT Required
Length Normalization: You do not need to normalize the vectors to a length of 1 beforehand.
The cosine similarity formula automatically divides by the magnitudes, performing this normalization during the calculation.