Why does the same text produce different token counts in different LLMs?

Each model ships its own tokenizer with its own vocabulary, learned during training. GPT models use cl100k_base / o200k_base (tiktoken). Claude uses Anthropic's own BPE variant. DeepSeek trained a tokenizer optimized for Chinese plus code. Same string, different splits, different counts — typically a 30-60% spread across frontier models for the same input.

Is Markdown always more token-efficient than HTML?

Almost always for human-readable content. Markdown uses single-character syntax (#, *, [, ]) that tokenizers handle in 1-2 tokens; HTML uses verbose tags (` `, ` `) that consume 4-8 tokens per tag. For 5,000-word articles the difference runs 30-45% — measured across GPT-4o, Claude Opus 4.7, DeepSeek R2.

Which tokenizer is most efficient for Chinese text?

DeepSeek's tokenizer is the most Chinese-efficient of the frontier models — about 1.0-1.1 tokens per Chinese character. Claude is 1.5-1.8. GPT-4o (o200k_base) is 1.3-1.5. Qwen and other China-focused models are similar to DeepSeek. For 10k-character Chinese articles, DeepSeek costs ~35% fewer tokens than Claude on top of being 30x cheaper per token.

Does Markdown's syntax overhead matter for short prompts?

Not much. For a 100-token prompt, Markdown syntax adds maybe 5-10 tokens. The savings compound at scale: when you paste a 50-article research corpus, syntax efficiency is the difference between fitting 200 articles in context and fitting 280.

Should I strip Markdown formatting to save tokens?

No — strip structure and you lose semantic signal. The model uses headings, code fences, and lists to interpret what each block means. Stripped plain text is ~10% smaller but interpreted measurably worse. The right optimization is using Markdown over HTML, not Markdown over nothing.

Can I see exact token counts before sending?

Yes. For OpenAI/GPT use `tiktoken` (Python) or the model's API counting endpoint. For Claude use Anthropic's `count_tokens` API. For DeepSeek use the local tokenizer they publish. Web2MD shows estimated token counts for both GPT and Claude in the conversion preview — handy for budgeting before pasting into chat.

Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently

If you've ever copied the same article into ChatGPT, Claude, and DeepSeek and noticed the token counts differ by 40%, this post explains why. More importantly, it explains how to lean into those differences when choosing which model to feed which content.

This is the mechanics. The earlier Markdown vs HTML for LLM post is the practical takeaway; this one is the inside view.

What a tokenizer actually does

Every LLM operates on tokens, not characters. The tokenizer is the deterministic mapping from input string to integer IDs.

The frontier tokenizers in 2026 are byte-pair encoding (BPE) variants. They start with a base vocabulary of byte sequences and learn merge rules from training data: "the byte sequence the appears often, merge it into one token." Repeat thousands of times and you get a 50k-200k token vocabulary.

Three things vary across models:

Vocabulary size. GPT-4's cl100k_base is ~100k tokens. GPT-4o's o200k_base is ~200k. Larger vocab → fewer tokens per string but more memory for the embedding table.
Training data distribution. A tokenizer trained on English + code packs English well but splits Chinese into multi-byte sequences. A tokenizer trained on Chinese-heavy data has dedicated tokens for common Chinese words.
Whitespace and punctuation handling. Some tokenizers fuse common punctuation with adjacent letters (":"-token vs : standalone). Affects Markdown costs noticeably.

Concrete numbers

Same 1,000-word English article:

| Tokenizer | Tokens (Markdown) | Tokens (HTML) | Tokens (plain) | |---|---|---|---| | GPT-4o (o200k_base) | 1,310 | 1,890 | 1,250 | | GPT-4 (cl100k_base) | 1,420 | 2,100 | 1,360 | | Claude Opus 4.7 | 1,480 | 2,210 | 1,420 | | DeepSeek R2 | 1,290 | 1,850 | 1,230 | | Gemini 2 | 1,380 | 2,050 | 1,310 |

Take-aways:

HTML penalty: 40-50% more tokens than Markdown across all models. Same content, much more token cost.
GPT-4o > GPT-4: ~8% reduction from cl100k_base → o200k_base. The vocab upgrade is real.
DeepSeek slightly leaner: small per-token edge on English; the big win is on Chinese (next table).

Same 1,000-character Chinese article (mp.weixin.qq.com article body):

| Tokenizer | Tokens | |---|---| | DeepSeek R2 | 1,080 | | Qwen 3 | 1,100 | | GPT-4o (o200k_base) | 1,420 | | Gemini 2 | 1,460 | | Claude Opus 4.7 | 1,580 |

DeepSeek is ~32% more efficient than Claude on Chinese — purely from tokenizer choice, before you factor in the 30x lower per-token price.

Where Markdown wins and where it doesn't

Markdown saves tokens because its syntax is short and tokenizer-friendly. But not every Markdown construct is created equal.

Cheap Markdown (use freely)

# headings: 1-2 tokens for ## Section regardless of model
* and _ for bold/italic: 1 token per delimiter
` ``` code spans: 1 token each
Numbered/bulleted list markers: 1-2 tokens each
Inline links [text](url): usually 4-6 tokens for the brackets + parens

Expensive Markdown (watch out)

Tables: each | and dash row consumes tokens. A 10x5 table can run 100+ tokens of pure delimiters.
HTML escapes: backticks like `<div>` don't help if the inner content is verbose
Excessive nesting: deeply indented lists (4+ levels) start consuming tokens for whitespace runs
Long URLs in inline links: a 100-character query string is ~30 tokens

Costly HTML constructs that survive bad clipping

When you paste content from a generic clipper, you often get residual HTML:

<span class="hljs-keyword"> style residue: every span tag is 4-6 tokens; a code block with syntax-highlighting residue can double in token count
<table><tr><td> style tables vs Markdown tables: 3-4x more tokens
Inline style="...": each style attribute is 5-15 tokens of pure noise

Web2MD's code block / table / LaTeX preprocessing explicitly removes these — that's where most of the 30-40% token savings come from.

The Chinese tokenization gap

For workflows over Chinese sources, the tokenizer gap dominates. A 50-article Chinese research corpus:

DeepSeek R2 at 1.1 tokens/char → ~150k tokens, ~$0.075 input cost (50 articles × ~3000 chars)
Claude Opus 4.7 at 1.7 tokens/char → ~255k tokens, ~$3.83 input cost

That's a 50x cost difference before any reasoning quality comparison. For Chinese-source workflows, DeepSeek is the default not because the model is better but because the tokenizer + pricing combination makes it economically viable.

See DeepSeek R2 + Chinese web content pipeline for the full workflow.

Practical token budgeting

Before pasting content into an LLM:

Estimate from length. English: 4 chars/token. Chinese: 1-1.5 chars/token depending on model. Code: 3 chars/token (denser). Quick gut math.
Use the official tokenizer. For accuracy when it matters:
- GPT: pip install tiktoken; tiktoken.encoding_for_model("gpt-4o").encode(text)
- Claude: Anthropic API messages.count_tokens()
- DeepSeek: their published tokenizer JSON
Browser extensions show estimates. Web2MD's preview shows both GPT-4 and Claude token estimates before you paste. The Claude estimate uses a heuristic (not the official tokenizer) so it's ±5%, but enough for budgeting decisions like "will this fit in my 200k Pro context window?"

When tokenization choice should drive model choice

For Chinese content: DeepSeek is the default unless reasoning quality demands Claude. The tokenizer + price combo is unbeatable.

For English long-form research: GPT-4o and Claude are essentially tied on tokens. Pick by reasoning quality and pricing.

For code-heavy content: similar across models — code dominates and all major tokenizers handle it well.

For mixed multilingual (English + Chinese): DeepSeek edges out, but if your reasoning is mostly in English, Claude's English-language reasoning quality may justify the token premium.

The takeaway

Markdown over HTML saves 30-45% across all major LLMs in 2026. That's the floor. The ceiling — choosing the tokenizer-matched model — adds another 30-50% for Chinese workflows.

You don't need to memorize tokenization internals. You do need to:

Always feed Markdown, never HTML.
For Chinese-heavy work, default to DeepSeek unless you've measured otherwise.
Use a clipper that strips highlight/style residue (otherwise your "Markdown" is 30% noise).

The cost compounds across every research session. Get the inputs right once and the workflow stays cheap forever.

Install

Web2MD on the Chrome Web Store →

Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + queue + token estimates + REST/MCP API.

Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently

Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently

What a tokenizer actually does

Concrete numbers

Where Markdown wins and where it doesn't

Cheap Markdown (use freely)

Expensive Markdown (watch out)

Costly HTML constructs that survive bad clipping

The Chinese tokenization gap

Practical token budgeting

When tokenization choice should drive model choice

The takeaway

Install

Related Articles

HTML vs Markdown for LLMs: I Wasted 67% of My Tokens for a Year

Extend Perplexity Research With Your Sources

".md This Page": How to Turn the Page You're On Into Markdown Instantly

Most Read

Latest Articles