Long Input Size (Gemini)
LLMs primarily deal with long inputs by employing a context window and several advanced techniques to manage the computational and memory challenges this presents.
The context window is the maximum amount of text (measured in tokens) the model can consider or "remember" at any one time to generate a coherent response.
1. Context Window Expansion
Modern LLMs have significantly expanded context windows, moving from a few thousand tokens to hundreds of thousands or even over a million tokens in some models.
This large working memory allows the model to process long documents, extensive codebases, or long-running conversations in a single request, which is often the most direct and effective approach.
Tokens: Text is broken down into smaller units called tokens (which can be words, sub-words, or characters). The context window is measured by this token count.
Benefits: With a large context window, the model can retain the full context, leading to more accurate summarization, question-answering, and generation of coherent long-form content.
The original Transformer architecture, which powers LLMs, has a computational bottleneck because its self-attention mechanism scales quadratically (O(n2)) with the input sequence length (n). To overcome this, researchers have developed several techniques:
Efficient Attention Mechanisms:
Flash Attention: A technique that significantly reduces memory usage and speeds up computation by minimizing redundant memory access during the attention calculation.
Sliding Window Attention (e.g., Longformer): Limits each token to attend only to its nearby tokens within a fixed-size window, reducing the complexity from quadratic to linear (O(n)). Some global tokens are often included to capture long-range dependencies.
Positional Encoding Techniques: Positional encodings tell the model about the order of tokens, which is crucial since the self-attention mechanism processes all tokens in parallel.
Rotary Position Embedding (RoPE) and Attention with Linear Biases (ALiBi): These methods improve the model's ability to extrapolate to sequences longer than those seen during training and handle long-range dependencies more efficiently than traditional absolute positional encodings.
Key-Value (KV) Caching Optimization: During inference (text generation), the model reuses the "key" and "value" vectors for previously processed tokens. Optimizing this storage (KV cache) is vital for efficient long-sequence processing.
When inputs are too long to fit even in a very large context window, or when dealing with vast external knowledge, LLMs are augmented with external memory systems.
Retrieval-Augmented Generation (RAG): This is a very common technique where the long input (a document, a knowledge base) is broken into smaller chunks and stored in a searchable database (a vector database).
When a query comes in, the system retrieves only the small, most relevant chunks from the database.
These few chunks are then fed to the LLM as part of the prompt (the context) to generate the response.
This efficiently sidesteps the context window limitation by only giving the model the information it needs, rather than the entire document.
For users working within the constraints of a finite context window, simple but effective strategies can be used.
Summarization/Compression: If a conversation is getting too long, previous turns can be automatically summarized and the summary is fed as context instead of the full transcript.
Truncation/Filtering: In long conversations, the oldest or least relevant parts of the input may be dropped to keep the total token count under the limit.
Placement of Information: Research has shown that LLMs sometimes exhibit a "lost in the middle" problem, where performance drops for key information placed in the center of a very long input. Users can optimize performance by placing the most important information or the core question near the beginning or end of the long context.
Large Language Models (LLMs) deal with long inputs using a combination of expanded context windows and sophisticated context management strategies to overcome their inherent token limits.
Here are the key strategies, including Rolling Context, Retrieval-Augmented Generation (RAG), and the role of LangChain:
Some modern LLMs are designed with significantly larger context windows (up to hundreds of thousands or even millions of tokens). This is the simplest strategy: the model can simply take a much longer prompt (including a large document and the user's query) and process it directly.
Pros: Simplicity, high potential for comprehensive understanding of the entire document, and on-the-fly reasoning.
Cons: Higher computational cost and latency (since the model processes the entire long input), and performance can sometimes degrade as key information is buried deep within the context (the "needle in a haystack" problem).
Rolling context, often employed in conversational agents, is a technique used to maintain the memory of a long conversation that exceeds the model's context window. Instead of passing the entire history with every new turn, the system:
Summarizes or compresses older parts of the conversation into a concise summary.
Passes this summary along with the most recent user messages as the input context for the next response.
The model uses the summary and the new messages to generate a coherent reply.
This ensures the LLM retains key themes and facts from the dialogue without running out of tokens or incurring the cost of processing a continually growing, massive history.
RAG is a powerful and efficient architectural strategy to handle inputs that are too long or contain information external to the LLM's training data. Instead of feeding the entire document to the model, RAG follows a two-phase process:
Retrieval: The long input document (or a vast knowledge base) is first broken into smaller, meaningful chunks. These chunks are converted into numerical representations called embeddings and stored in a searchable database (like a vector store). When a user poses a question, the system uses the query's embedding to search and retrieve only the top-k most relevant chunks from the database.
Generation: These few, highly-relevant chunks are then injected into the LLM's context window along with the original user query. The LLM uses this focused context to generate a response, effectively augmenting its knowledge with the specific, retrieved information.
Pros: Cost-efficient (fewer tokens are processed), up-to-date knowledge (as the external database can be updated constantly), and explainability (the sources/chunks used can often be cited).
Cons: Performance heavily relies on the quality of the chunking strategy and the retriever's accuracy—if the retriever misses the key information, the LLM can't answer correctly.
LangChain is a popular framework designed to simplify the development of applications powered by LLMs. It acts as an orchestration layer that ties together the various components needed for advanced strategies like RAG and complex conversational flows.
LangChain provides abstractions and tools that directly address the challenges of long inputs:
Document Loaders & Text Splitters: It offers modules to load data from various sources (files, databases, websites) and split large documents into optimized, smaller chunks for RAG.
Retrieval: It integrates seamlessly with various vector stores and provides different retrieval methods (like vector search or keyword search) to fetch the most relevant context.
Chains and Memory: For conversational applications, LangChain provides memory modules (including summarization memory for rolling context) that manage the chat history, ensuring only relevant or compressed context is passed to the LLM for each new turn.
Prompt Engineering: It helps structure the final prompt that combines the user query, system instructions, and the retrieved or summarized context, ensuring the LLM is given the optimal input for generation.
How do LLMs deal with long inputs. Describe LLM strategies including rolling context and RAG and Lang Chain. Include strategies for LLMs that allow expanded input of code.
Large Language Models (LLMs) have a fundamental limitation known as the context window or context length, which is the maximum number of tokens (words, sub-words, or code elements) they can process in a single request. When an input exceeds this limit, the model cannot see the entire input, leading to a loss of context, coherence, and accuracy.
To address this, especially for long documents, conversations, or large codebases, several strategies have been developed, broadly categorized as:
These techniques directly modify the LLM's architecture or training to increase the context window size, though they often come with increased computational cost. Recent advancements have resulted in models with context windows reaching hundreds of thousands or even millions of tokens.
Longer Context Windows: Newer, highly-capable models are being trained or fine-tuned with methods like Rotary Position Embedding (RoPE) or Attention Linear Bias (ALiBi) to natively support much larger context lengths (e.g., up to 200k or 1M+ tokens). This allows the LLM to process entire documents or large sections of code at once.
Attention Optimizations: Techniques like Sparse Attention or Sliding Window Attention reduce the computational complexity (which is typically quadratic, O(n2), with respect to sequence length n) by making the model attend to only a subset of relevant tokens, making longer inputs computationally feasible.
These strategies involve preprocessing the input or managing the conversation history outside of the core LLM call to keep the input within the model's limit while retaining essential information.
Rolling Context / Context Truncation and Summarization
This strategy is primarily used in long-running conversational applications where the conversation history constantly grows.
How it works:
The system maintains a full history of the conversation.
Before each new LLM call, the history is checked. If the token count exceeds the model's maximum input limit, the history is compressed.
Truncation: The oldest messages are simply cut off (dropped) to make room for the new input.
Summarization: An LLM itself is used to generate a concise summary of the older, irrelevant parts of the conversation. This summary is then inserted into the prompt instead of the original text, effectively preserving the gist of the past conversation while saving tokens.
Retrieval Augmented Generation (RAG)
RAG is a highly effective strategy for grounding LLMs in external knowledge—like internal documents, databases, or large codebases—without needing to fit all of it into the context window.
How it works:
Indexing: The external data (documents, code files) is broken into small, semantically meaningful chunks. These chunks are converted into numerical representations called embeddings and stored in a searchable database called a Vector Store.
Retrieval: When a user asks a question, the question is also converted into an embedding. The system searches the Vector Store to find the top K chunks of data whose embeddings are most similar to the question's embedding. This means only the most relevant pieces of information are retrieved.
Augmentation: These retrieved, highly-relevant chunks are then inserted into the LLM's prompt, along with the user's question.
Generation: The LLM uses this injected context to formulate a relevant and grounded answer.
Benefit: RAG bypasses the context window limitation by dynamically providing only the necessary context, rather than the entire corpus.
LangChain is a popular framework designed to orchestrate LLM workflows, and it includes robust abstractions for managing context, particularly for RAG and agentic flows.
Memory Abstractions: LangChain offers various Memory classes (e.g., ConversationBufferMemory, VectorStoreRetrieverMemory) that abstract the process of managing conversation history (rolling context) or integrating long-term memory (RAG).
Chains and Agents: It allows developers to build complex applications called "Chains" and "Agents" that automatically handle context management steps:
A RAG Chain automatically chunks documents, handles the vector store lookup, and injects the context into the final prompt.
Agents use tools and can decide when to use a tool to retrieve information (RAG) or manage their internal scratchpad (a form of rolling context for multi-step reasoning).
Code context presents a unique challenge because a small change in one file can depend on logic across dozens of others. Strategies combine external retrieval with specific architectural awareness.
RAG for Codebases: This is the primary method. Instead of documents, the code is chunked (often by file, function, or class). A query about a bug or feature triggers RAG, which retrieves relevant functions or files across the entire repository to be included in the LLM's context.
Tools like Code Agents use this RAG-based context to analyze the codebase and propose multi-file changes.
File-Based/Directory Context: Tools (often built with frameworks like LangChain) can be configured to automatically pull in the contents of the current file or even all files within a specific directory into the context, especially for code-focused models.
Context Engineering for Agents: Code-focused agents use dynamic context management to:
Selectively Truncate/Summarize: Focus on the most recent code changes or only the stack trace when debugging.
Scratchpads: Maintain a running log of decisions, file changes, and thoughts to keep the reasoning process in the context, even if the surrounding code is too long.
Model-Specific Optimizations: Dedicated code models are often designed or fine-tuned with extremely large context windows (e.g., 1M tokens) specifically to ingest and reason over vast amounts of code simultaneously, minimizing the need for complex external RAG lookups for smaller codebases.