Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently
Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently
If you've ever copied the same article into ChatGPT, Claude, and DeepSeek and noticed the token counts differ by 40%, this post explains why. More importantly, it explains how to lean into those differences when choosing which model to feed which content.
This is the mechanics. The earlier Markdown vs HTML for LLM post is the practical takeaway; this one is the inside view.
What a tokenizer actually does
Every LLM operates on tokens, not characters. The tokenizer is the deterministic mapping from input string to integer IDs.
The frontier tokenizers in 2026 are byte-pair encoding (BPE) variants. They start with a base vocabulary of byte sequences and learn merge rules from training data: "the byte sequence the appears often, merge it into one token." Repeat thousands of times and you get a 50k-200k token vocabulary.
Three things vary across models:
- Vocabulary size. GPT-4's
cl100k_baseis ~100k tokens. GPT-4o'so200k_baseis ~200k. Larger vocab → fewer tokens per string but more memory for the embedding table. - Training data distribution. A tokenizer trained on English + code packs English well but splits Chinese into multi-byte sequences. A tokenizer trained on Chinese-heavy data has dedicated tokens for common Chinese words.
- Whitespace and punctuation handling. Some tokenizers fuse common punctuation with adjacent letters (
":"-token vs:standalone). Affects Markdown costs noticeably.
Concrete numbers
Same 1,000-word English article:
| Tokenizer | Tokens (Markdown) | Tokens (HTML) | Tokens (plain) | |---|---|---|---| | GPT-4o (o200k_base) | 1,310 | 1,890 | 1,250 | | GPT-4 (cl100k_base) | 1,420 | 2,100 | 1,360 | | Claude Opus 4.7 | 1,480 | 2,210 | 1,420 | | DeepSeek R2 | 1,290 | 1,850 | 1,230 | | Gemini 2 | 1,380 | 2,050 | 1,310 |
Take-aways:
- HTML penalty: 40-50% more tokens than Markdown across all models. Same content, much more token cost.
- GPT-4o > GPT-4: ~8% reduction from
cl100k_base→o200k_base. The vocab upgrade is real. - DeepSeek slightly leaner: small per-token edge on English; the big win is on Chinese (next table).
Same 1,000-character Chinese article (mp.weixin.qq.com article body):
| Tokenizer | Tokens | |---|---| | DeepSeek R2 | 1,080 | | Qwen 3 | 1,100 | | GPT-4o (o200k_base) | 1,420 | | Gemini 2 | 1,460 | | Claude Opus 4.7 | 1,580 |
DeepSeek is ~32% more efficient than Claude on Chinese — purely from tokenizer choice, before you factor in the 30x lower per-token price.
Where Markdown wins and where it doesn't
Markdown saves tokens because its syntax is short and tokenizer-friendly. But not every Markdown construct is created equal.
Cheap Markdown (use freely)
#headings: 1-2 tokens for## Sectionregardless of model*and_for bold/italic: 1 token per delimiter```` code spans: 1 token each- Numbered/bulleted list markers: 1-2 tokens each
- Inline links
[text](url): usually 4-6 tokens for the brackets + parens
Expensive Markdown (watch out)
- Tables: each
|and dash row consumes tokens. A 10x5 table can run 100+ tokens of pure delimiters. - HTML escapes: backticks like
`<div>`don't help if the inner content is verbose - Excessive nesting: deeply indented lists (4+ levels) start consuming tokens for whitespace runs
- Long URLs in inline links: a 100-character query string is ~30 tokens
Costly HTML constructs that survive bad clipping
When you paste content from a generic clipper, you often get residual HTML:
<span class="hljs-keyword">style residue: every span tag is 4-6 tokens; a code block with syntax-highlighting residue can double in token count<table><tr><td>style tables vs Markdown tables: 3-4x more tokens- Inline
style="...": each style attribute is 5-15 tokens of pure noise
Web2MD's code block / table / LaTeX preprocessing explicitly removes these — that's where most of the 30-40% token savings come from.
The Chinese tokenization gap
For workflows over Chinese sources, the tokenizer gap dominates. A 50-article Chinese research corpus:
- DeepSeek R2 at 1.1 tokens/char → ~150k tokens, ~$0.075 input cost (50 articles × ~3000 chars)
- Claude Opus 4.7 at 1.7 tokens/char → ~255k tokens, ~$3.83 input cost
That's a 50x cost difference before any reasoning quality comparison. For Chinese-source workflows, DeepSeek is the default not because the model is better but because the tokenizer + pricing combination makes it economically viable.
See DeepSeek R2 + Chinese web content pipeline for the full workflow.
Practical token budgeting
Before pasting content into an LLM:
- Estimate from length. English: 4 chars/token. Chinese: 1-1.5 chars/token depending on model. Code: 3 chars/token (denser). Quick gut math.
- Use the official tokenizer. For accuracy when it matters:
- GPT:
pip install tiktoken; tiktoken.encoding_for_model("gpt-4o").encode(text) - Claude: Anthropic API
messages.count_tokens() - DeepSeek: their published tokenizer JSON
- GPT:
- Browser extensions show estimates. Web2MD's preview shows both GPT-4 and Claude token estimates before you paste. The Claude estimate uses a heuristic (not the official tokenizer) so it's ±5%, but enough for budgeting decisions like "will this fit in my 200k Pro context window?"
When tokenization choice should drive model choice
For Chinese content: DeepSeek is the default unless reasoning quality demands Claude. The tokenizer + price combo is unbeatable.
For English long-form research: GPT-4o and Claude are essentially tied on tokens. Pick by reasoning quality and pricing.
For code-heavy content: similar across models — code dominates and all major tokenizers handle it well.
For mixed multilingual (English + Chinese): DeepSeek edges out, but if your reasoning is mostly in English, Claude's English-language reasoning quality may justify the token premium.
The takeaway
Markdown over HTML saves 30-45% across all major LLMs in 2026. That's the floor. The ceiling — choosing the tokenizer-matched model — adds another 30-50% for Chinese workflows.
You don't need to memorize tokenization internals. You do need to:
- Always feed Markdown, never HTML.
- For Chinese-heavy work, default to DeepSeek unless you've measured otherwise.
- Use a clipper that strips highlight/style residue (otherwise your "Markdown" is 30% noise).
The cost compounds across every research session. Get the inputs right once and the workflow stays cheap forever.
Related
- Markdown vs HTML: which produces better AI answers?
- How to reduce LLM token costs (practical)
- DeepSeek R2 + Chinese web content pipeline
- How to fill Claude's 1M context window
Install
Web2MD on the Chrome Web Store →
Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + queue + token estimates + REST/MCP API.