AI Token Pricing Explained: Input, Output & Cached

AI model pricing is charged per token, not per request, and split into three rates: input tokens (what you send), output tokens (what the model writes back), and cached input tokens (a reused prompt prefix). Output is typically 3 to 5 times more expensive than input, and cached input is often about 90% cheaper than fresh input. A token is roughly 4 characters or 0.75 of an English word, so 1,000 tokens is about 750 words. Prices are quoted per million tokens. As an illustrative example, a model at $2 per million input tokens and $10 per million output tokens would cost $120 to process 35 million input and 5 million output tokens.

If you have ever looked at an AI provider's pricing page and seen a line like "$2.00 per 1M input tokens and $10.00 per 1M output tokens" and quietly moved on, this guide is the missing manual. Understanding it is the difference between a predictable cloud bill and a surprise invoice. Below we break down what a token is, why input and output are priced differently, how prompt caching quietly saves teams the most money, and how to estimate and compare the real cost of a task before you run it. The specific dollar figures we cite are representative published rates as of 2026 and are used only to illustrate the mechanics; always confirm current numbers on the provider's own pricing page.

What a token is, and why you pay per token

A token is the unit an AI model reads and writes. It is a chunk of text, usually a sub-word fragment rather than a whole word. In English, one token averages about 4 characters or roughly 0.75 of a word, so 1,000 tokens is approximately 750 words and a dense page of text is around 500 to 800 tokens. Common words are often a single token; rare words, code, punctuation, and other languages split into more tokens. Every model has its own tokenizer, so the exact count varies slightly between providers.

Both halves of a conversation are counted. The input (also called the prompt) is everything you send: your system instructions, the user question, any documents or retrieved context, and the running conversation history. The output (also called the completion) is everything the model generates in reply. You are billed for both, at different rates.

Why per token and not per request? Because the cost of running a model scales with the amount of text processed and generated, not with the number of API calls. A one-line question and a request that stuffs a 50,000-word document into the prompt are both "one request," but the second does far more work. Per-token billing ties your cost to the actual compute you consume. It also means two things drive your bill that a flat per-request price would hide: how long your prompts are, and how much the model writes back.

The dual rate: why input and output cost different amounts

Input tokens versus output tokens — output priced roughly five times higher per million tokens — The dual rate: input and output tokens are billed separately, with output commonly 3 to 5 times more expensive. Illustration.

The single most important thing to internalize about AI pricing is that input and output are billed at different rates, and output is more expensive. Across major providers, output tokens typically cost 3 to 5 times more than input tokens. When a price is written as "$2 per million input tokens and $10 per million output tokens," output is 5 times the input rate.

The reason is technical. A model reads your entire prompt in parallel in a single forward pass, which is relatively efficient. It then generates the reply one token at a time, and each new token requires another pass over the whole growing context. Producing text is simply more compute-intensive per token than reading it, and the price reflects that asymmetry.

The practical consequence is that verbose outputs are where budgets quietly bleed. Doubling the length of your prompt raises input cost; asking the model to write twice as much raises the pricier output cost. A task that reads a lot but answers briefly (classification, extraction, routing) is cheap. A task that writes a lot (long-form drafting, code generation, detailed reasoning) is where the output rate dominates. We come back to this when we compare models, because two models with the same input price can differ sharply once output is in the mix.

A worked example: what a real task actually costs

Abstract rates are hard to feel, so let us price a concrete job. Suppose you run a support desk and want to summarize and tag 10,000 tickets. We will use an illustrative model priced at $2 per million input tokens and $10 per million output tokens (the same shape as Claude Sonnet 5's published 2026 rates).

The setup

Each of the 10,000 requests looks like this:

Fixed system prompt and instructions: 1,500 input tokens (identical on every request)
The ticket text: 2,000 input tokens (different every time)
Total input per request: 3,500 tokens
The generated summary plus tags: 500 output tokens

The math, step by step

First, total the tokens across the whole batch. Input is 3,500 tokens times 10,000 requests, which is 35,000,000 tokens (35 million). Output is 500 tokens times 10,000 requests, which is 5,000,000 tokens (5 million).

Now apply the rates. Rates are quoted per million tokens, so divide the token count by one million and multiply by the rate:

Input: 35 million tokens at $2 per million tokens is 35 times $2, which is $70.
Output: 5 million tokens at $10 per million tokens is 5 times $10, which is $50.
Total: $70 plus $50 is $120 for the full batch of 10,000 tickets.

That is $0.012 per ticket. Notice the split: even though there are 7 times more input tokens than output tokens (35 million versus 5 million), output still accounts for $50 of the $120 bill because its rate is 5 times higher. That is the dual rate at work.

Token type	Rate (per 1M tokens)	Tokens in this batch	Cost
Input (uncached)	$2	35 million	$70
Output	$10	5 million	$50
Total		40 million	$120

Prompt caching: the biggest discount most people miss

Prompt caching — a reused prompt prefix read from cache at roughly 90% off the input rate — Prompt caching lets a repeated prompt prefix be read at a fraction of the normal input rate, often around 10% of it. Illustration.

Prompt caching is a discount on repeated input. When many of your requests begin with the same block of text (a long system prompt, a fixed set of instructions, a reference document, or few-shot examples), providers can store that prefix and let subsequent requests reuse it instead of re-reading it from scratch. Cached input reads are commonly billed at about 10% of the normal input rate, which is roughly a 90% discount on those tokens.

There is usually a small catch: the first time a prefix is written to cache often carries a modest premium (commonly around 25% above the base input rate, provider-dependent), and cached entries expire after a time-to-live, frequently around 5 minutes, sometimes longer for a higher write premium. Caching pays off when the same prefix is reused many times before it expires, which is exactly the pattern in agents with a big fixed system prompt, retrieval systems that share a knowledge base, multi-turn chats, and high-volume batch jobs.

How caching changes the worked example

Return to the support-desk batch. Of each request's 3,500 input tokens, the 1,500-token system prompt is identical every time, while the 2,000-token ticket text is unique. Only the fixed prefix can be cached.

Across 10,000 requests, the system prompt accounts for 1,500 times 10,000, which is 15,000,000 tokens (15 million). Priced normally at $2 per million tokens, that is $30. If instead those tokens are read from cache at about $0.20 per million tokens (roughly 10% of the input rate), the same 15 million tokens cost about $3. Assuming the shared prefix stays warm in cache and treating the one-time write premium as negligible because a single prefix is reused thousands of times, the system-prompt portion drops from $30 to about $3, a saving of about $27.

The rest of the bill is unchanged: the unique ticket text is still 20 million input tokens at $2 per million tokens ($40), and output is still 5 million tokens at $10 per million tokens ($50). The new total is about $3 plus $40 plus $50, which is about $93, down from $120. That is roughly a 22% cut on the whole job, achieved by caching a single block of repeated text. On workloads with a large fixed context (long system prompts, big retrieved documents), the savings are often far greater.

The other pricing tiers: introductory, long-context, and batch

Beyond the input, output, and cached rates, a provider's price for the same model can shift along several other axes. Knowing them prevents nasty surprises.

Introductory versus standard pricing

New models sometimes launch at promotional, introductory rates to drive adoption, then move to standard pricing later. Off-peak or discounted windows also exist for some providers. When you build a cost estimate on a launch price, note that it may not be permanent, and re-check before you commit a high-volume workload. This is one reason we treat every dollar figure in this guide as illustrative rather than fixed.

Long-context surcharges

Many models keep one price up to a context threshold and charge more above it. A common pattern is a higher input and output rate once a single request exceeds around 200,000 tokens, because very long contexts are more expensive to serve. If your use case routinely sends huge prompts (entire codebases, long transcripts, book-length documents), check whether you cross a long-context tier, because the per-million rate above the threshold can be meaningfully higher than the headline number.

Batch pricing

For work that does not need an immediate answer, most major providers offer an asynchronous batch mode at roughly a 50% discount. You submit a large job, the provider processes it within a window (often up to 24 hours), and you pay about half the standard rate. Applied to our support-desk batch, the same $120 of work run through a batch API at a 50% discount would cost about $60. Batch mode stacks nicely with caching for large offline jobs, though exact stacking rules vary by provider.

What actually drives your bill

The four levers that drive AI model cost — prompt length, output length, volume, and model tier — Four levers set your AI bill: how long your prompts are, how much the model writes, how many calls you make, and which model tier you pick. Illustration.

Your invoice is the product of four levers. Pull any one and the cost moves:

Prompt length (input tokens). System prompts, retrieved context, conversation history, and attached documents all count as input on every call. Long, unpruned histories and oversized retrieved chunks are a common source of silent cost. Trimming context and caching fixed prefixes are the two biggest input savings.
Output length (output tokens). Because output is the pricier rate, controlling how much the model writes is often the single highest-leverage change. Set a sensible maximum output length, ask for concise formats, and avoid requesting long restatements you do not need.
Volume (number of calls). Cost scales linearly with request count. Ten thousand tickets cost ten times one thousand. Deduplicating, batching, and caching repeated work all reduce effective volume.
Model tier. Frontier models cost many times more than smaller, faster models. Routing easy requests to a cheaper model and reserving the expensive tier for hard cases is the most powerful cost lever of all, and the numbers below show why.

How to compare two models honestly

The formula to estimate AI task cost — input tokens times input rate plus output tokens times output rate — Estimate any task: multiply input tokens by the input rate, add output tokens times the output rate, then multiply by your call volume. Illustration.

The mistake almost everyone makes is comparing models on the headline input price. That number alone tells you very little, because your real cost depends on your mix of input and output. The honest way to compare is to price your actual workload on each model with the same simple formula:

Cost per request = (input tokens times input rate) + (output tokens times output rate). Then multiply by your number of requests.

Take the exact same support-desk workload from earlier (35 million input tokens and 5 million output tokens, before any caching) and price it across a spread of real 2026 models. The differences are dramatic:

Model	Input (per 1M)	Output (per 1M)	Cost for this task
DeepSeek V4	$0.14	$0.28	$6.30
Kimi K2.7	$0.95	$4	$53.25
Claude Sonnet 5	$2	$10	$120
Gemini 3.1 Pro	$2	$12	$130
Claude Opus 4.8	$5	$25	$300
GPT-5.5	$5	$30	$325

Two lessons jump out. First, the spread is enormous: the same job ranges from $6.30 on DeepSeek V4 to $325 on GPT-5.5, a factor of more than 50, before any quality judgment enters the picture. Second, the input price can be identical while the total differs. Claude Sonnet 5 and Gemini 3.1 Pro both charge $2 per million input tokens, yet this task costs $120 on one and $130 on the other purely because of a $2 gap in the output rate. Judge cost on the blend, never the input line alone.

The mix matters even more on output-heavy work. Flip the workload to long-form generation with 5 million input tokens and 40 million output tokens, and the ranking stretches further: the same task runs roughly $11.90 on DeepSeek V4, about $410 on Claude Sonnet 5, and about $1,225 on GPT-5.5. When the model writes a lot, the output rate is the whole story, and cheap-output models pull far ahead. If you want to see how these models stack up on capability as well as price, our head-to-head breakdowns such as Claude Sonnet 5 vs GPT-5.5, Claude Sonnet 5 vs DeepSeek V4, Kimi K2.7 vs DeepSeek V4, and GPT-5.5 vs DeepSeek V4 put the numbers side by side.

The hidden line item: reasoning tokens

One more factor can quietly multiply an output bill. Many current models can "think" before answering, generating internal reasoning tokens that are usually billed as output even though you never see most of them. On a reasoning-heavy prompt, a model might produce thousands of hidden reasoning tokens for a short visible answer, so the output charge can dwarf what the final text suggests. When you estimate cost for a reasoning model, budget for the thinking tokens, not just the reply you read, and check whether the provider lets you cap or disable extended reasoning for simple tasks.

The bottom line

AI pricing looks intimidating but reduces to a handful of rules. You pay per token, not per request. Input and output are billed separately, and output usually costs 3 to 5 times more, so verbose replies are where money leaks. Cached input is often about 90% cheaper than fresh input, which makes prompt caching the highest-return optimization for any workload with a repeated prefix. Batch mode trims roughly half off non-urgent jobs, and very long contexts can cross a pricier tier. To compare models, price your real input-output mix with the simple formula above rather than trusting the headline input number, and remember that reasoning tokens count as output.

Put those together and cost control becomes concrete: shorten prompts, cap and tighten outputs, cache fixed context, batch what can wait, and route each request to the cheapest model that clears your quality bar. For per-model rates and capability notes, our individual reviews of Claude Sonnet 5, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4, Kimi K2.7, and Claude Opus 4.8 each list the current published pricing, and every dollar figure in this explainer is a 2026-era illustration you should verify against the provider's live pricing page before you build on it.

Frequently Asked Questions

What is a token in AI model pricing?

A token is the unit of text an AI model reads and writes, usually a sub-word fragment. In English, one token averages about 4 characters or roughly 0.75 of a word, so 1,000 tokens is about 750 words. Both your input (the prompt) and the model's output (the reply) are measured in tokens, and prices are quoted per million tokens.

Why do AI models charge per token instead of per request?

Because cost scales with the amount of text processed, not the number of API calls. A one-line question and a request containing a 50,000-word document are both a single request, but the second does far more work. Per-token billing ties your cost to the compute you actually use, which is why prompt length and output length both drive your bill.

Why are output tokens more expensive than input tokens?

A model reads your whole prompt in one parallel pass, which is efficient, but it generates the reply one token at a time, with each new token requiring another pass over the growing context. Producing text is more compute-intensive per token than reading it, so output typically costs 3 to 5 times more than input. In a "$2 input, $10 output per million tokens" model, output is 5 times the input rate.

What does "$2 per million input tokens and $10 per million output tokens" mean?

It means you pay $2 for every one million tokens you send and $10 for every one million tokens the model writes back. To price a task, divide your token count by one million and multiply by the rate. For example, 35 million input tokens cost 35 times $2 ($70), and 5 million output tokens cost 5 times $10 ($50), for a total of $120.

What is prompt caching and how much does it save?

Prompt caching stores a repeated prompt prefix, such as a fixed system prompt or a reference document, so later requests reuse it instead of re-paying full input price. Cached reads are commonly billed at about 10% of the input rate, roughly a 90% discount on those tokens. In our worked example, caching a shared 1,500-token system prompt across 10,000 requests cut that portion from about $30 to about $3 and took the whole batch from $120 to about $93.

What is the difference between input and cached input pricing?

Fresh input is text the model reads for the first time, billed at the full input rate. Cached input is a previously seen prefix read from the provider's cache, typically billed at about 10% of the input rate. Writing a prefix to cache the first time often carries a small premium (around 25% above the base rate), and cached entries expire after a time-to-live, frequently around 5 minutes, so caching pays off when the same prefix is reused many times.

Do long-context requests cost more?

Often yes. Many models keep one rate up to a context threshold and charge a higher input and output rate above it, commonly once a single request exceeds around 200,000 tokens, because very long contexts are more expensive to serve. If you routinely send huge prompts such as entire codebases or long transcripts, check whether you cross a long-context tier where the per-million rate is higher than the headline number.

How much does the batch API discount save?

For work that does not need an immediate answer, most major providers offer an asynchronous batch mode at roughly a 50% discount, processing your job within a window that is often up to 24 hours. Applied to our support-desk example, the same $120 of work would cost about $60 in batch mode. Batch pricing can often be combined with prompt caching for large offline jobs, though exact stacking rules vary by provider.

How do I estimate the cost of a task before running it?

Use the formula: cost per request equals input tokens times the input rate, plus output tokens times the output rate, then multiply by your number of requests. Estimate tokens by multiplying your English word count by about 1.33, or use the provider's tokenizer for accuracy. Remember to include system prompts, retrieved context, and conversation history as input, and any hidden reasoning tokens as output.

How do I compare the real cost of two models like Claude Sonnet 5 and GPT-5.5?

Never compare on the input price alone; price your actual input-output mix on each model. On a workload of 35 million input and 5 million output tokens, Claude Sonnet 5 (at $2 input and $10 output per million tokens) costs about $120, while GPT-5.5 (at $5 input and $30 output per million tokens) costs about $325. Two models can even share an input price yet differ on total: Claude Sonnet 5 and Gemini 3.1 Pro both charge $2 input, but the task costs $120 versus $130 because of their output rates.

Are reasoning or thinking tokens billed separately?

Most models that "think" before answering generate internal reasoning tokens that are billed as output, even though you usually never see them. On a reasoning-heavy prompt, a model can produce thousands of hidden reasoning tokens for a short visible answer, so the output charge can be much larger than the final text implies. When budgeting for a reasoning model, account for those thinking tokens and check whether you can cap or disable extended reasoning on simple tasks.

AI Model Pricing Explained: Input, Output & Cached Tokens