Context Window

Definition & meaning

Definition

A Context Window is the maximum amount of text (measured in tokens) that a large language model can process in a single interaction. It determines how much information the model can "see" at once — including the system prompt, conversation history, uploaded documents, and the generated response. A larger context window allows the model to work with longer documents, maintain coherent multi-turn conversations, and reason over more information simultaneously. Context windows have expanded dramatically: from GPT-3's 4K tokens to Claude's 200K tokens and Gemini's 1M+ tokens in 2026. However, larger windows increase cost and latency, and models may still struggle with information buried in the middle of very long contexts (the "lost in the middle" problem).

How It Works

The context window is the maximum number of tokens an LLM can process in a single interaction—encompassing both the input (your prompt, system instructions, and any injected context) and the generated output. It functions as the model's working memory. Technically, the context window is determined by the positional encoding scheme used during training and the amount of GPU memory available for the key-value (KV) cache, which stores attention states for all tokens in the sequence. Standard transformer attention scales quadratically with sequence length (O(n^2) in compute and memory), which is why extending context windows is an active area of research. Techniques like RoPE (Rotary Position Embeddings) with YaRN scaling, Flash Attention (which reduces memory from O(n^2) to O(n) via tiled computation), sliding window attention, and KV-cache compression allow modern models to handle longer contexts. However, research consistently shows that models perform best on information near the beginning and end of the context, with degraded attention in the middle—a phenomenon known as "lost in the middle."

Why It Matters

Context window size determines what you can actually do with an LLM in a single call. A 4K-token window barely fits a long document. A 128K window can ingest an entire codebase. A 1M window can process a full book. This directly impacts your architecture: with a small context window, you must rely on RAG to retrieve relevant chunks; with a large one, you might stuff entire documents directly. But bigger isn't always better—cost scales linearly with token count, latency increases, and attention quality degrades. Understanding context window trade-offs helps you design AI features that are both effective and cost-efficient. Knowing about "lost in the middle" helps you structure prompts for maximum accuracy.

Real-World Examples

Anthropic's Claude offers up to 200K tokens in its standard API and even larger windows for select use cases. Google's Gemini 1.5 Pro supports 1 million tokens natively. OpenAI's GPT-4o provides 128K tokens. Open-source models like Llama 3 offer 128K tokens, while Mistral's models range from 32K to 128K. On ThePlanetTools.ai, context window size is a key evaluation criterion in our AI tool reviews—it determines whether a coding tool like Cursor can understand your entire project structure or only the file you're editing, which directly affects code suggestion quality.