Tokenization, Vectorization and Embedding (Gemini)
Tokens, Vectorization, and Embeddings represent sequential, interconnected steps.
The steps bridge the gap between human language and machine learning computation. Tokenization handles syntax, and vectorization and embeddings handles semantics.
Tokenization breaks raw text into smaller subunits, such as words, subwords, or characters. Each unit is assigned a unique integer ID from a vocabulary.
Vector embeddings are mathematical representations of data such as words, sentences, or images where the meaning of the data is translated into an array of floating-point numbers.
Vector embeddings are the foundation of modern AI. By mapping information into a high-dimensional space, models can group similar concepts together.
Raw data, text, images, or audio, is converted to vectors through tokenization and embedding and stored in specialized vector databases or multidimensional arrays for fast retrieval.
Semantic Proximity: Items with similar meanings or characteristics have vectors numerically closer.
These spaces can have hundreds or thousands of features or dimensions, where each number represents a specific, abstract feature of the data.
Example: arranging concepts on a 2D graph based on two features: sweetness x-axis and size the y-axis.In this space: Apple might be assigned the coordinates 8, 5.Orange might be assigned 7, 6.Car might be assigned 1, 20 Because their coordinates are mathematically similar.
The model understands an apple and an orange are more closely related to one another than to a car.
Tokenization is about deconstruction, language broken into units. Raw text is a continuous string of characters. Tokenization chops the text into smaller pieces called tokens (which can be words, subwords, or characters).
These tokens are mapped to arbitrary, integer IDs in a static vocabulary list (e.g., the word apple might be assigned Token ID 4501.
At this stage, the model knows the text is broken down, but it does not understand what the tokens actually mean or how they relate to one another.
Floating-point Numbers (Embeddings): The model cannot do math with abstract identifiers. The integer token IDs are fed into an embedding lookup table.
This replaces the integer with a vector, a long sequence of floating-point numbers that represents the meaning and context of that token in a multidimensional space.
Floating-point calculations: Every subsequent layer in the neural network (like attention mechanisms) operates exclusively on floating-point numbers.
Why are tokenizaton and embedding steps not combined? Separating these steps keeps the model flexible and manageable:
Vocabulary Control: If the embedding layer had to deal with raw words directly, the vocabulary size would be infinite and unmanageable.
By tokenizing first, the model limits its vocabulary to a fixed set of subwords.
Reusability: Tokenization acts as a fixed indexing dictionary, whereas embeddings are continuously learned and adjusted during model training.
Hardware Efficiency: It is computationally cheaper to retrieve a vector from a pre-calculated embedding matrix using a simple Token ID than it is to process raw text directly into vectors.
How Vectors Are Stored
Vector Databases: These are specialized databases designed to store, manage, and query high-dimensional vectors.
Instead of exact keyword matches, they use similarity searches (like cosine similarity) to find data that is mathematically close.
Popular vector databases include Pinecone, Milvus, and Qdrant.
Dimensionality Reduction: To save storage space and increase processing speed, techniques like Principal Component Analysis (PCA) or t\-SNE are sometimes used to reduce the number of dimensions in the vector while retaining its core meaning.
Indexing Structures: To prevent the database from having to compare a query vector against every single stored vector (which is too slow for large datasets), databases use indexing structures like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index).
These algorithms group similar vectors together so the system can quickly zoom in on the right area.
Summary
Tokenization
Function
Splits text into chunks
Input
Raw Text
Output
Token IDs (Integers)
Vectorization
Function
Represents tokens numerically
Input
Token IDs
Output
Sparse Vectors
Embedding
Function
Encodes semantic relationships
Input
Sparse Vectors
Output
Dense Vectors (Floats)
Dimensionality reduction is an optional, post-embedding processing step applied after tokens are successfully converted into high-dimensional vectors.
It compresses bulky vectors (e.g., 1536 dimensions) into smaller, denser vectors (e.g., 384 dimensions) to drastically speed up search times and reduce memory storage while retaining the core semantic meaning.