(Gemini)
This tab for Vectors is about how vectors are used in LLMs. It is part of RAG started and described here: Retrieval-Augmented Generation (RAG)
The "Retrieval" step is the engine that makes RAG work. It doesn't just search for keywords like a traditional search; it uses something called Vector Embeddings.
Vectorization (The "Math" Step):
Before you even ask a question, all the documents in your library are broken into small chunks and converted into long lists of numbers called vectors. These numbers represent the semantic meaning of the text.
Example: In vector space, the word "King" is mathematically closer to "Queen" than it is to "Apple."
The Query:
When you submit your prompt, the system converts your question into a vector using the same mathematical model.
Similarity Search:
The system then looks through its Vector Database to find the document chunks whose "numbers" most closely match the "numbers" of your question. This is often done using a calculation called Cosine Similarity, which measures the angle between two vectors.
Context Window:
The top-ranking results (the most relevant snippets) are pulled out and placed into the Context Window—the short-term memory the LLM uses to write its final response.
Why is this better than a simple Keyword Search?
A keyword search might fail if you ask about "felines" but the document only uses the word "cats." Because RAG uses vectors, it "knows" that cats and felines are the same concept, so it will still retrieve the correct information.
What happens next?
Once those documents are retrieved, the LLM performs Generation. It treats the retrieved text as the "source of truth" and ignores its own outdated training data if the two conflict.
Would you like to see a simple Python code example of how a vector search is performed, or should we talk about how to evaluate if a RAG system is actually working well?
1. The Code: A Simple Vector Search
This example uses a common library called sentence-transformers to turn text into math (vectors) and then find the closest match.
2. The Evaluation: The "RAG Triad"
Simply building a RAG system isn't enough; you have to make sure it isn't "hallucinating" or ignoring the context. We evaluate this using three metrics:
Context Relevance: Did the retrieval step actually find the right information? If you ask about "Apples" and the system retrieves "Oranges," the retrieval part is broken.
Groundedness (Faithfulness): Is the answer derived only from the retrieved context? If the document says "The price is $10" but the AI says "The price is $15," it has failed the groundedness test.
Answer Relevance: Does the final output actually answer the user's question? Sometimes an AI gets distracted by the extra context and forgets to be helpful.
Why RAG Fails (and how to fix it)
Bad Chunking: If your document "chunks" are too small, they lose meaning. If they are too large, they include too much "noise."
Weak Embedding Models: Some models are better at technical jargon than others.
Vector "Drift": Sometimes the math says two things are similar when they are actually opposites (e.g., "I love this" vs. "I hate this").
Would you like to explore how Chunking Strategies work, or should we look at the different types of Vector Databases like Pinecone or Weaviate?
2. Chunking Strategies: The Art of "Slicing" Data
"Chunking" is the process of breaking a long document (like a 100-page PDF) into smaller pieces so the LLM can process them. If your chunks are too small, you lose context. If they are too large, you get "noise."
A. Fixed-Size Chunking
This is the simplest method. You decide on a set number of characters or tokens (e.g., 500 tokens).
Pros: Easy to implement; consistent.
Cons: Often cuts a sentence right in the middle, destroying the meaning.
B. Recursive Character Splitting
The industry standard (often used via LangChain). It tries to split at natural boundaries like paragraphs (\n\n), then newlines (\n), then spaces.
Pros: Keeps related ideas together.
Cons: Still might separate a header from its following paragraph.
C. Semantic Chunking
The most advanced method. It uses an AI model to look at the "meaning" of sentences. It only breaks a chunk when it detects that the topic has changed.
Pros: Highest accuracy for retrieval.
Cons: More computationally expensive (requires more processing time).
D. Context-Aware (Sliding Window)
You create chunks of 500 tokens but include a 100-token overlap from the previous chunk.
Why? This ensures that even if an important fact is split at the boundary, the context is preserved in both chunks.
It is a common point of confusion, but the short answer is no. A standard LLM (like GPT-4 or a base Gemini model) does not perform a vector search as part of its internal "thinking" process to generate a response.
To understand why, it helps to look at the two different "memories" an AI can use.
1. Internal Memory (Parametric Knowledge)
When an LLM generates a response normally, it relies on Parametric Knowledge. This is the information it "learned" during its massive training phase.
It uses math to predict the next most likely word in a sequence.
It does not look up a database. It simply "remembers" patterns from trillions of sentences it has seen before.
2. External Memory (The Vector Search)
A Vector Search is a tool we build around the LLM to give it extra information. This is the "Retrieval" part of RAG.
The vector search happens in a separate database (like Pinecone or Weaviate).
It finds relevant documents and hands them to the LLM as part of the prompt.
The "Hidden" Vector Search
While the LLM doesn't search a database to find facts, it does use vectors internally to understand your question.
Every word you type is converted into a vector (a list of numbers) so the LLM can process it. It uses these vectors to understand the relationships between your words. This is called the Transformer Architecture.
The Key Distinction:
Internal Vectors: Used by the LLM to understand the meaning and logic of your sentence.
Vector Search: An external process used to find specific facts in a database to help the LLM be more accurate.
Why should Cosine Similarity, which measures the angle between two vectors have anything to do with semantic search
It feels counterintuitive at first why would the angle between two lists of numbers tell us if a sentence about "baking" is related to a sentence about "ovens"?
The secret lies in Vector Space. When we turn text into vectors (embeddings), we aren't just assigning random numbers. We are placing that text into a map with thousands of dimensions, where each dimension represents a specific "feature" or "concept."
How Angle Equals Meaning
Imagine a simple 2D map where one axis represents "Fruitiness" and the other represents "Sweetness."
The word "Apple" would have high values for both.
The word "Banana" would also have high values for both.
The word "Hammer" would have near-zero values for both.
In this map, the arrows (vectors) for "Apple" and "Banana" would point in roughly the same direction. The angle between them is very small. Because they point toward the same "concepts," we can mathematically conclude they are semantically related.
Why use Angle (Cosine) instead of Distance (Euclidean)?
You might wonder: "Why not just measure the distance between the tips of the arrows?"
The reason is Magnitude (Length). In text processing, a long document might mention the word "Python" 50 times, while a short tweet mentions it once.
The length of the vector for the long document will be much larger because of the frequency.
However, both vectors point in the same direction (the "Programming" direction).
Cosine Similarity ignores how "long" the vectors are and only looks at the direction they are pointing. This allows a RAG system to find a 10-page whitepaper that matches your 5-word question, even though their "sizes" are completely different.
The Logic SummaryDirection = Topic: If two vectors point the same way, they share the same context.Angle = Similarity: * $0^\circ$ Angle ($\text{Cosine} = 1$): The meanings are identical.
To clarify a very common point of confusion: The LLM does not use "Vector Search" (retrieval from a database) to predict the next word.
However, it does use Vector Math and Linear Algebra internally to do so. Here is the distinction between the two:
1. The LLM's Internal Math (Not a Search)
When an LLM is predicting the next word, it is performing a massive series of matrix multiplications.
Input to Vector: Your input words are converted into vectors (embeddings).
Processing: These vectors pass through the "layers" of the model. Each layer adjusts the numbers based on the context of the other words (this is called Attention).
The Probability Head: At the very end, the model has a final vector that represents "the meaning of the next word."
Softmax: The model compares this "ideal" vector against its entire vocabulary (e.g., 50,000 words). It assigns a probability to every single word.
2. The Vector Search (An External Tool)
A Vector Search (like Pinecone or Weaviate) is a separate software system. The LLM only "uses" it if a developer has set up a RAG pipeline.
In a RAG setup:
The Vector Search happens first. It finds a relevant document (e.g., a Wikipedia paragraph).
That document is turned back into text and pasted into the prompt.
Then the LLM uses its internal math to predict the next word based on that new text.
The confusion happens because both processes use "Vectors":
The "Lookup" Analogy
Internal Prediction: Like a human speaking fluently. You don't "search" your brain for the next word; it just flows based on your internal "wiring" (parameters).
Vector Search: Like a human pausing to look at a reference book. You are explicitly searching an external source to get more info before you continue speaking.
In an LLM, the "meaning" of a word isn't fixed. It changes based on the words around it. This process is called Self-Attention, and it’s how the model transforms a static vector into a "context-aware" vector.
1. The Static Vector (The Starting Point)
When you type the word "bank," the model pulls a pre-defined list of numbers for that word. At this stage, the vector for "bank" is the same whether you are talking about a river bank or a savings bank.
2. The Attention Mechanism (The "Context" Filter)
To figure out which "bank" you mean, the model looks at every other word in your sentence. It uses three internal vectors for every word:
Query (Q): "What am I looking for?"
Key (K): "What information do I contain?"
Value (V): "What information should I contribute to the final meaning?"
3. The Math of "Focus"
The model performs a Dot Product (a cousin of Cosine Similarity) between the Query of "bank" and the Keys of all other words:
Sentence A: "I went to the bank to deposit money."
The "bank" Query hits the "money" Key and gets a high score.
The model "attends" to the word money and pulls the "financial" aspects into the vector for bank.
Sentence B: "I sat on the bank of the river."
The "bank" Query hits the "river" Key and gets a high score.
The model "attends" to the word river and pulls the "geographic" aspects into the vector.
4. The Result: A New Vector
By the time the math is done, the vector for "bank" has been mathematically "tilted" or updated. It is no longer just "bank"; it is now a vector that specifically represents "financial institution" or "edge of water."
Why this matters for the "Next Word"
Once the model has these context-aware vectors, it can accurately predict what comes next.
If the vector for "bank" is tilted toward river, it predicts "water" or "flow."
If it’s tilted toward money, it predicts "interest" or "account."
Summary: The Layered Process
Input: Static vectors (from a lookup table).
Attention: Vectors "talk" to each other to find context.
Update: Vectors change their values based on those conversations.
Prediction: The final, highly specific vector is used to guess the next word.
tbd