Inference

Q: How much does inference cost?

As of early 2026, GPT-4o runs approximately $2.50 per million input tokens and $10 per million output tokens. Claude Opus costs roughly $15 per million input tokens and $75 per million output tokens. Smaller models like Claude Haiku or GPT-4o Mini cost 10-50x less. These costs drive architectural decisions about what AI processes, how much context to include, and which content gets prioritized in retrieval systems.

Q: Is inference getting cheaper?

Yes — roughly 10x cheaper every 18 months, driven by hardware improvements (newer GPU generations), architectural optimizations (mixture of experts, quantization), and competitive market pressure. This means AI systems will process more content over time, expanding what gets crawled and analyzed. But structured, efficiently tokenized content will always have a cost advantage over sprawling, unstructured pages.

The moment AI goes from training to thinking — and why speed matters.

// The Concept

Inference is when a trained model generates output from new input. Training teaches the model — it processes trillions of tokens over weeks or months using thousands of GPUs, adjusting billions of parameters through backpropagation. Inference uses what it learned — it takes a new input, runs it through the frozen model weights, and produces a response. Every time you send a prompt to ChatGPT or Claude, that is inference. Every Google search that triggers an AI Overview is inference. Every Perplexity.ai answer is inference.

The distinction matters because the economics of inference — cost per token, latency per request, throughput per GPU — determine what is commercially viable in AI. Training GPT-4 reportedly cost over $100 million. But inference is the ongoing cost that scales with every user, every query, every page processed. OpenAI serves hundreds of millions of inference requests per day. The cost of processing your content through a transformer is not zero — and that cost influences which content AI systems choose to process, retrieve, and cite.

Inference is also where all the interesting optimizations happen. Training is a brute-force process: throw more data and more compute at the problem. Inference optimization is surgical: reduce the computation per token, cache intermediate results, predict multiple tokens at once, quantize weights from 32-bit to 8-bit or 4-bit precision. These optimizations determine whether AI search can serve a billion queries a day at a cost that sustains a business. They also determine the computational budget available for processing your content.

The AI industry is in an inference-dominated phase. Training a frontier model happens once. Inference happens billions of times. The companies winning the AI race are not just training better models — they are serving inference more efficiently. This is why inference speed, cost, and architecture matter for anyone whose content needs to be processed by these systems.

// How It Works

During inference, input tokens are embedded, passed through every transformer layer, and the final layer produces a probability distribution over the entire vocabulary. The model samples from this distribution — controlled by a temperature parameter — to select the next token. Then it appends that token to the sequence and repeats. Each new token requires incorporating all previous tokens through the attention mechanism, which is why generation gets progressively slower as the output grows.

// Inference: next-token generation pipeline

// Step 1: Encode input
input = tokenize("What is a knowledge graph?")  // 6 tokens
h = forward_pass(input, layers=1..120)       // full transformer

// Step 2: Generate next token
logits = h[-1] * W_vocab         // project to vocab space
probs  = softmax(logits / 0.7)    // temperature = 0.7
next   = sample(probs)              // "A" (token 32)

// Step 3: Repeat with cached computations
for step in 1..max_tokens:
  h = forward_pass(next, kv_cache)  // only new token + cache
  next = sample(softmax(h * W))    // next token

// Key optimization techniques:
KV-Cache        // reuse key/value from previous tokens
Speculative     // small model predicts, large model verifies
Quantization    // FP32 → INT8 → INT4 (4x-8x speedup)
Batching        // process multiple requests simultaneously
MoE routing     // activate only relevant expert sub-networks

// Cost per million tokens (early 2026 estimates):
GPT-4o          $2.50 input   $10.00 output
Claude Opus     $15.00 input  $75.00 output
Claude Haiku    $0.25 input   $1.25 output
GPT-4o Mini     $0.15 input   $0.60 output
  

KV-caching is the most important inference optimization. During the first pass (called "prefill"), the model computes key and value vectors for every token in the input. These are cached. On subsequent generation steps, only the new token needs to go through the full computation — the cached keys and values from all previous positions are reused. Without KV-caching, generating a 500-token response from a 2000-token input would require re-processing all 2000+ tokens at every single generation step. With KV-caching, each step processes only the single new token against the cached state.

Speculative decoding is a technique where a smaller, faster "draft" model generates several candidate tokens ahead, and the larger model verifies them in a single batch. If the draft model guessed correctly (which it often does for common patterns), multiple tokens are accepted at once. This can increase throughput by 2-3x for routine text while maintaining the quality of the larger model for surprising or complex passages.

Mixture of Experts (MoE) architectures, used in models like Mixtral and reportedly GPT-4, activate only a subset of the model's parameters for each token. A routing network decides which "expert" sub-networks are most relevant. This means a model with 1.8 trillion total parameters might only use 280 billion parameters per token — dramatically reducing inference cost while maintaining the knowledge capacity of the full parameter count.

// Why It Matters for Search

Inference economics determine which content AI systems will process. This is not a theoretical concern — it is an active engineering constraint at every AI company. When Google generates an AI Overview, it must retrieve relevant pages, process them through a transformer, and generate a coherent response. Each of those steps costs tokens. Each token costs compute. At billions of queries per day, these costs are measured in hundreds of millions of dollars annually.

Longer pages cost more tokens to process. A 5,000-word article requires roughly 7,000 tokens of input processing — significantly more expensive than a 1,500-word article at roughly 2,000 tokens. Complex pages with many distinct entities, multiple topic threads, and dense cross-references require more attention computation per token because the model must maintain more active relationships. AI companies make explicit economic decisions about what to crawl, how much to process, and what to cite, based on inference cost versus value.

Well-structured, efficiently tokenized content is literally cheaper to process. Clean HTML with clear header hierarchy, schema markup, and focused paragraphs requires less computation to understand than the same information buried in cluttered markup, interstitial ads, and rambling prose. The model spends fewer attention cycles disambiguating the content structure and more cycles encoding the actual information. This gives structured content an inherent efficiency advantage in AI systems that must balance quality against cost at massive scale.

Front-loading matters because of how attention budgets work in causal transformers. The first tokens in a sequence receive attention from every subsequent token. Tokens at position 100 can be attended to by positions 101 through N. Tokens at position 5,000 can only be attended to by positions 5,001 through N. Your most important entity signals, credentials, and topic declarations belong in the first 500 tokens — where they receive maximum attention exposure across the entire inference process.

// In Practice

Respect the AI's computational budget. Every token on your page costs something to process during inference. This does not mean writing less — it means writing with purpose. Every paragraph should earn its token cost by adding genuine information, not repeating what was already stated. A 1,200-word page with zero filler creates a denser, more cost-efficient representation than a 3,000-word page where half the tokens are transitional fluff.

Front-load your most important content. The first 1,000 tokens of your page — roughly the first 750 words — are processed at full attention by every subsequent token in the model's context. Your entity credentials, primary topic declaration, unique value proposition, and key claims should all appear in this window. Do not bury your entity introduction on page 3 of a 5,000-word article. The inference economics punish late-loaded authority signals.

Use clear structure that reduces the inference cost of understanding your page. Headers, lists, and schema markup are not just formatting — they are computational shortcuts. A well-structured page with H2 headers that accurately summarize their sections lets the model build a page-level representation more efficiently. The headers serve as high-information-density anchor points that the attention mechanism can use to organize the entire page without having to process every paragraph at equal depth.

Consider the token efficiency of your markup. Excessive inline styles, deeply nested div structures, and verbose class names all consume tokens during inference. Clean, semantic HTML is not just a best practice for accessibility — it is an inference efficiency optimization. The model does not care about your CSS framework. It cares about the content tokens. Minimize the token overhead between your entity signals and the model's attention.

Schema markup in JSON-LD is particularly inference-efficient because it provides machine-readable entity declarations in a compact, structured format that requires minimal interpretation. The model can extract entity type, name, credentials, and relationships from JSON-LD with far fewer attention cycles than parsing the same information from natural language paragraphs. This is one reason why pages with robust schema consistently outperform pages without it in AI-driven systems — they are literally cheaper and faster to understand.

// FAQ

How much does inference cost?

Costs vary dramatically by model size and provider. As of early 2026, GPT-4o processes input at approximately $2.50 per million tokens and generates output at $10 per million tokens. Claude Opus runs approximately $15 per million input tokens and $75 per million output tokens. Smaller, faster models like Claude Haiku ($0.25/$1.25) and GPT-4o Mini ($0.15/$0.60) are 10-100x cheaper. These costs drive every architectural decision about what AI systems process. When your page is retrieved by an AI search engine, the inference cost of processing it is a real factor in whether and how thoroughly it gets analyzed.

Is inference getting cheaper?

Yes, and rapidly. Inference costs have been dropping roughly 10x every 18 months, driven by three forces: hardware improvements (each new GPU generation delivers 2-3x more inference throughput), architectural optimizations (quantization, speculative decoding, MoE routing), and competitive market pressure (providers competing on price). This means AI systems will process more content over time — expanding what gets crawled, retrieved, and cited. But the relative advantage of structured, efficiently tokenized content persists regardless of absolute cost levels. At any price point, clear content is cheaper to process than cluttered content.

Go deeper with practitioners

Join the Burstiness & Perplexity community for AI infrastructure discussions and optimization strategies.

Join the Community

Inference

How much does inference cost?

Is inference getting cheaper?

Related Concepts

Go deeper with practitioners