Tokenization

How AI actually reads your content — and it's not word by word.

// The Concept

Tokenization is the process of converting text into the discrete units — tokens — that language models actually process. This is the first transformation your content undergoes when any AI system reads it, and it shapes everything that follows. Models don't read words. They don't read characters. They read tokens — subword units that break language into statistically optimal chunks.

The word "unhappiness" might become three tokens: "un", "happi", "ness". The word "the" is almost certainly a single token — it's so common that it earned its own entry in the vocabulary. A URL like "https://burstinessandperplexity.com" might fragment into eight or more tokens, each one consuming a position in the context window. Your brand name, your technical terminology, your formatting choices — all of these are shaped by tokenization in ways that directly affect how AI systems process and represent your content.

Modern language models use vocabularies of approximately 100,000 tokens. GPT-4 uses a Byte-Pair Encoding (BPE) vocabulary of about 100K tokens. Claude uses a similar approach. These vocabularies are built from training data through a statistical process that optimizes for compression — the most common character sequences become single tokens, while rare sequences get split into smaller pieces. The result is a vocabulary that can represent any text, but represents common text far more efficiently than rare text.

// How It Works

Byte-Pair Encoding starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pairs. In the first pass, if "t" and "h" appear next to each other more frequently than any other pair in the training corpus, they get merged into a single token "th". In the next pass, maybe "th" and "e" merge into "the". This continues for tens of thousands of iterations, building up progressively longer token units from the bottom up.

// BPE tokenization in action // Starting vocabulary: all individual bytes (256) // Training: iteratively merge most frequent pairs // Example: "unhappiness" tokenization "unhappiness" → ["un", "h", "app", "iness"] // 4 tokens // Common words = fewer tokens "the" → ["the"] // 1 token "because" → ["because"] // 1 token "SEO" → ["SE", "O"] // 2 tokens // Brand names and technical terms "NovCog" → ["Nov", "Cog"] // 2 tokens "burstiness" → ["bur", "st", "iness"] // 3 tokens "perplexity" → ["perp", "lex", "ity"] // 3 tokens // URLs are token-expensive "https://burstinessandperplexity.com" → ["https", "://", "bur", "st", "iness", "and", "perp", "lex", "ity", ".", "com"] // 11 tokens // Vocabulary size comparison: GPT-2 50,257 tokens GPT-4 ~100,000 tokens // cl100k_base Claude ~100,000 tokens // similar BPE scheme LLaMA 32,000 tokens // SentencePiece BPE

The result of this process: common English words typically map to one or two tokens. Technical jargon and proper nouns often require three or more. Non-English text, code, and specialized notation tend to tokenize inefficiently — consuming more tokens for the same amount of semantic content. This isn't a design flaw. It's a direct consequence of the training data distribution: the tokenizer optimizes for the text it sees most often.

A subtle but important consequence: token boundaries affect model understanding. The model's internal representation is built token by token. When a word is a single token, the model processes it as an atomic unit with a single, learned embedding. When a word fragments across multiple tokens, the model must compose meaning across those fragments — which it does reasonably well, but not perfectly. Common words have stronger, more nuanced representations than rare words precisely because they're encoded as single, well-trained tokens.

There's a newer approach called SentencePiece, used by models like LLaMA, which operates on raw text without pre-tokenization steps. And Google's Gemini uses a variation that handles multilingual text more efficiently. But the core principle remains the same: statistical frequency determines token boundaries, and token boundaries determine how the model "sees" your text.

// Why It Matters for Search

Tokenization determines how AI "sees" your brand name, your technical terms, and your content structure. This has cascading effects on every downstream process — from how your content fits within context windows, to how efficiently your entity gets encoded in hidden states, to how much it costs AI companies to process your pages at scale.

Short, common brand names tokenize as single units, giving them stronger, more atomic representation in the model's vocabulary space. A brand name that maps to one token has a single, dense embedding vector that encodes everything the model learned about that brand during training. A brand name that fragments into four tokens requires the model to compose a representation from pieces — workable, but inherently noisier.

This is why brand naming matters for the AI era. "NovCog" likely tokenizes as two clean tokens — "Nov" and "Cog" — both of which carry meaningful semantic associations (novelty, cognition). "Novel Cognition Artificial Intelligence Solutions" tokenizes into seven or more tokens, diluting the brand signal across fragmented pieces. In an AI-first visibility strategy, short and semantically loaded beats long and descriptive.

Token count also directly determines API cost. Every AI company prices by token. When Perplexity.ai crawls and processes your page, the token count determines the computational cost. When ChatGPT's browsing feature reads your content, the token count determines how much of your page fits in its working memory alongside other sources. Clean, well-structured content — without excessive boilerplate, redundant navigation text, or unnecessarily verbose phrasing — is literally cheaper for AI to process. And when AI systems face budget constraints on how much content to evaluate, efficient tokenization means more of your substance gets processed.

Whitespace and formatting consume tokens too. Every paragraph break, every bullet point marker, every heading tag contributes to the token count. This doesn't mean you should strip formatting — structure aids comprehension for both humans and AI. But it means that bloated HTML, excessive inline styles, and redundant markup all consume tokens that could otherwise carry semantic content.

// In Practice

Keep entity names short and distinctive. "NovCog" is likely two tokens. "Novel Cognition" is three or more. In AI-processed contexts where every token position matters, shorter entity names occupy less of the model's limited attention bandwidth while maintaining equal or stronger representation. When choosing brand names, domain names, or product names for the AI era, count the likely tokens.

Use standard terminology where possible. AI models have stronger representations for common tokens because those tokens appeared more frequently in training data. When you use well-established terms like "machine learning," "neural network," or "content optimization," the model processes them with high-fidelity representations. When you coin novel terminology, the model must compose understanding from subword fragments — which works, but introduces representational noise. Use established vocabulary as your foundation, and introduce novel terms only when they carry genuinely distinct meaning.

You can test your content's tokenization directly. OpenAI provides a free tokenizer tool (tiktoken), and HuggingFace hosts tokenizers for most major models. Run your key pages through these tools. Check how your brand name tokenizes. Check how your most important technical terms tokenize. Check the total token count of your pages. This data informs practical decisions about content structure, naming, and length.

For structured data: JSON-LD schema is generally token-efficient because it uses a compact, standardized syntax. The overhead of schema markup is modest — typically 200-400 tokens — but the entity signals it provides are disproportionately valuable because they're explicitly structured rather than requiring extraction from natural language. It's one of the best token-to-value ratios in content optimization.

How many tokens is a typical web page?

A 1,500-word page is roughly 2,000-2,500 tokens, well within any modern context window. A typical blog post of 800-1,200 words runs about 1,000-1,600 tokens. However, the raw HTML of that same page — including navigation, footer, scripts, and markup — might be 5,000-10,000 tokens. Most AI crawling systems strip boilerplate before processing, but the ratio of useful content tokens to total page tokens matters for processing efficiency.

Does tokenization affect SEO directly?

Not directly for traditional Google Search — the search index doesn't operate on language model tokens. But for AI-powered citation systems like Perplexity.ai, Google's AI Overviews, and ChatGPT with browsing, efficient tokenization means your content fits more completely into the model's working memory. When your 1,500-word article competes against a 5,000-word article for the same context window space, your content gets processed more completely and more efficiently. This translates to better representation in the model's hidden state, which influences citation probability.

Go deeper with practitioners

Join the Burstiness & Perplexity community for implementation support and weekly discussions.

Join the Community