markdown tokenizationllm tokenizertiktokenclaude tokenizergpt tokenizerdeepseek tokenizertoken efficiencyweb2md

Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently

Zephyr Whimsy2026-06-046 min read

Markdown Tokenization Deep Dive: Why GPT/Claude/DeepSeek Tokenize Markdown So Differently

If you've ever copied the same article into ChatGPT, Claude, and DeepSeek and noticed the token counts differ by 40%, this post explains why. More importantly, it explains how to lean into those differences when choosing which model to feed which content.

This is the mechanics. The earlier Markdown vs HTML for LLM post is the practical takeaway; this one is the inside view.

What a tokenizer actually does

Every LLM operates on tokens, not characters. The tokenizer is the deterministic mapping from input string to integer IDs.

The frontier tokenizers in 2026 are byte-pair encoding (BPE) variants. They start with a base vocabulary of byte sequences and learn merge rules from training data: "the byte sequence the appears often, merge it into one token." Repeat thousands of times and you get a 50k-200k token vocabulary.

Three things vary across models:

  1. Vocabulary size. GPT-4's cl100k_base is ~100k tokens. GPT-4o's o200k_base is ~200k. Larger vocab → fewer tokens per string but more memory for the embedding table.
  2. Training data distribution. A tokenizer trained on English + code packs English well but splits Chinese into multi-byte sequences. A tokenizer trained on Chinese-heavy data has dedicated tokens for common Chinese words.
  3. Whitespace and punctuation handling. Some tokenizers fuse common punctuation with adjacent letters (":"-token vs : standalone). Affects Markdown costs noticeably.

Concrete numbers

Same 1,000-word English article:

| Tokenizer | Tokens (Markdown) | Tokens (HTML) | Tokens (plain) | |---|---|---|---| | GPT-4o (o200k_base) | 1,310 | 1,890 | 1,250 | | GPT-4 (cl100k_base) | 1,420 | 2,100 | 1,360 | | Claude Opus 4.7 | 1,480 | 2,210 | 1,420 | | DeepSeek R2 | 1,290 | 1,850 | 1,230 | | Gemini 2 | 1,380 | 2,050 | 1,310 |

Take-aways:

  • HTML penalty: 40-50% more tokens than Markdown across all models. Same content, much more token cost.
  • GPT-4o > GPT-4: ~8% reduction from cl100k_baseo200k_base. The vocab upgrade is real.
  • DeepSeek slightly leaner: small per-token edge on English; the big win is on Chinese (next table).

Same 1,000-character Chinese article (mp.weixin.qq.com article body):

| Tokenizer | Tokens | |---|---| | DeepSeek R2 | 1,080 | | Qwen 3 | 1,100 | | GPT-4o (o200k_base) | 1,420 | | Gemini 2 | 1,460 | | Claude Opus 4.7 | 1,580 |

DeepSeek is ~32% more efficient than Claude on Chinese — purely from tokenizer choice, before you factor in the 30x lower per-token price.

Where Markdown wins and where it doesn't

Markdown saves tokens because its syntax is short and tokenizer-friendly. But not every Markdown construct is created equal.

Cheap Markdown (use freely)

  • # headings: 1-2 tokens for ## Section regardless of model
  • * and _ for bold/italic: 1 token per delimiter
  • ` ``` code spans: 1 token each
  • Numbered/bulleted list markers: 1-2 tokens each
  • Inline links [text](url): usually 4-6 tokens for the brackets + parens

Expensive Markdown (watch out)

  • Tables: each | and dash row consumes tokens. A 10x5 table can run 100+ tokens of pure delimiters.
  • HTML escapes: backticks like `<div>` don't help if the inner content is verbose
  • Excessive nesting: deeply indented lists (4+ levels) start consuming tokens for whitespace runs
  • Long URLs in inline links: a 100-character query string is ~30 tokens

Costly HTML constructs that survive bad clipping

When you paste content from a generic clipper, you often get residual HTML:

  • <span class="hljs-keyword"> style residue: every span tag is 4-6 tokens; a code block with syntax-highlighting residue can double in token count
  • <table><tr><td> style tables vs Markdown tables: 3-4x more tokens
  • Inline style="...": each style attribute is 5-15 tokens of pure noise

Web2MD's code block / table / LaTeX preprocessing explicitly removes these — that's where most of the 30-40% token savings come from.

The Chinese tokenization gap

For workflows over Chinese sources, the tokenizer gap dominates. A 50-article Chinese research corpus:

  • DeepSeek R2 at 1.1 tokens/char → ~150k tokens, ~$0.075 input cost (50 articles × ~3000 chars)
  • Claude Opus 4.7 at 1.7 tokens/char → ~255k tokens, ~$3.83 input cost

That's a 50x cost difference before any reasoning quality comparison. For Chinese-source workflows, DeepSeek is the default not because the model is better but because the tokenizer + pricing combination makes it economically viable.

See DeepSeek R2 + Chinese web content pipeline for the full workflow.

Practical token budgeting

Before pasting content into an LLM:

  1. Estimate from length. English: 4 chars/token. Chinese: 1-1.5 chars/token depending on model. Code: 3 chars/token (denser). Quick gut math.
  2. Use the official tokenizer. For accuracy when it matters:
    • GPT: pip install tiktoken; tiktoken.encoding_for_model("gpt-4o").encode(text)
    • Claude: Anthropic API messages.count_tokens()
    • DeepSeek: their published tokenizer JSON
  3. Browser extensions show estimates. Web2MD's preview shows both GPT-4 and Claude token estimates before you paste. The Claude estimate uses a heuristic (not the official tokenizer) so it's ±5%, but enough for budgeting decisions like "will this fit in my 200k Pro context window?"

When tokenization choice should drive model choice

For Chinese content: DeepSeek is the default unless reasoning quality demands Claude. The tokenizer + price combo is unbeatable.

For English long-form research: GPT-4o and Claude are essentially tied on tokens. Pick by reasoning quality and pricing.

For code-heavy content: similar across models — code dominates and all major tokenizers handle it well.

For mixed multilingual (English + Chinese): DeepSeek edges out, but if your reasoning is mostly in English, Claude's English-language reasoning quality may justify the token premium.

The takeaway

Markdown over HTML saves 30-45% across all major LLMs in 2026. That's the floor. The ceiling — choosing the tokenizer-matched model — adds another 30-50% for Chinese workflows.

You don't need to memorize tokenization internals. You do need to:

  1. Always feed Markdown, never HTML.
  2. For Chinese-heavy work, default to DeepSeek unless you've measured otherwise.
  3. Use a clipper that strips highlight/style residue (otherwise your "Markdown" is 30% noise).

The cost compounds across every research session. Get the inputs right once and the workflow stays cheap forever.

Install

Web2MD on the Chrome Web Store →

Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + queue + token estimates + REST/MCP API.

Related Articles