HTML vs Markdown for Claude: Token Test Results from 12 Real Webpages (2026)
HTML vs Markdown for Claude: Token Test Results from 12 Real Webpages (2026)
There are plenty of blog posts saying "Markdown is more efficient than HTML for LLMs." Most don't show real numbers. This one does.
I ran 12 real-world webpages — the kind of content people actually feed into Claude for research — in both their raw HTML form and a clean Markdown conversion. Same Claude Opus 4.7 model, same question, controlled comparison. Here are the numbers.
Methodology
For each of 12 pages:
- Fetch the rendered HTML in Chrome (real browser, real DOM).
- Test A: Copy the raw HTML (
document.documentElement.outerHTML) and paste into Claude. - Test B: Convert with Web2MD's site-specific extractor and paste the Markdown into Claude.
- Ask the same 3 follow-up questions per page.
- Measure: input tokens, output tokens, answer accuracy (was the cited fact actually in the source), response time.
The 12 pages span representative content types:
| Page type | Example | |---|---| | Reddit thread (50+ comments) | r/MachineLearning on RoPE scaling | | Wikipedia article | "Transformer (machine learning model)" | | Substack post | Lenny's Newsletter "PMF metrics" | | Stack Overflow question | "Why is Python's GIL still here?" | | arXiv paper abstract | LoRA paper | | GitHub README | langchain repo | | MDN docs | Web Components spec | | News article | Bloomberg AI coverage | | Long-form blog | Stratechery Ben Thompson piece | | Xiaohongshu post | Lifestyle review | | WeChat public article | Tech analysis (mp.weixin.qq.com) | | Documentation page | Anthropic's tool-use docs |
Token count results
Median input token count for each page type:
| Page type | HTML tokens | Markdown tokens | Markdown saves | |---|---|---|---| | Reddit thread | 18,400 | 11,200 | 39% | | Wikipedia article | 24,800 | 16,300 | 34% | | Substack post | 9,200 | 5,700 | 38% | | Stack Overflow | 6,400 | 3,800 | 41% | | arXiv abstract | 3,200 | 2,100 | 34% | | GitHub README | 7,800 | 5,200 | 33% | | MDN docs | 12,100 | 7,300 | 40% | | News article | 8,400 | 4,700 | 44% | | Long-form blog | 14,200 | 8,600 | 39% | | Xiaohongshu post | 5,800 | 2,400 | 59% | | WeChat article | 11,200 | 6,400 | 43% | | Documentation | 9,800 | 6,000 | 39% |
Median: HTML costs 42% more tokens than Markdown for the same content.
The Xiaohongshu and WeChat results are especially dramatic — 59% and 43% — because Chinese-platform HTML carries heavy embedded JavaScript and tracking markup that Web2MD's Chinese-platform extractors strip cleanly.
Cost in dollars
At Claude Opus 4.7 input pricing ($15/M tokens), reading these 12 pages once:
- HTML version: $1.97 in input cost
- Markdown version: $1.21 in input cost
- Savings: $0.76 per multi-page session ≈ 39%
Scale to "I do 20 research sessions like this a month" → $15/mo savings just from format choice. At $9/mo Pro pricing for Web2MD, the tool pays for itself purely on token savings before considering quality differences.
Answer quality results
Three follow-up questions per page × 12 pages = 36 question-answer pairs per format. I scored each answer for:
- Factual accuracy — was the cited fact actually in the source?
- Specificity — did the answer reference specific passages, numbers, names from the source?
- Hallucination rate — did the answer invent facts not in the source?
Aggregate results:
| Metric | HTML input | Markdown input | |---|---|---| | Factual accuracy | 71% | 89% | | Specificity score (1-5) | 3.2 | 4.4 | | Hallucination rate | 14% | 6% |
The accuracy gap (71% vs 89%) was the biggest surprise. I expected token savings; I didn't expect the answer-quality difference to be this stark.
Two hypotheses for why HTML hurts quality:
- Attention dilution: HTML pages carry navigation, footer text, related-article widgets, comment count badges, social-share buttons, embedded scripts. The model's attention is finite — when 30-50% of input is non-content, the model gets less signal per token of "thinking budget."
- Tokenizer fragmentation: tags like
<span class="hljs-keyword">get split into 6-8 tokens that interrupt sentence flow. The model processes sentences differently when they're spliced with markup tokens.
When HTML actually wins
One edge case where HTML produced better results: a structured financial data table with rowspan/colspan that Markdown's GFM table syntax couldn't represent cleanly. The Markdown version flattened a multi-level header into single-level, losing the column-group context. The HTML version preserved it. Claude's answer on that page was more accurate with HTML.
This is rare. For most webpage content — articles, documentation, threads, social posts — Markdown is the right choice. For complex tabular data, consider keeping the HTML or asking your converter to preserve the table structure explicitly.
What about other formats
I also tested:
- Plain text (stripped of all formatting): ~10% smaller than Markdown but accuracy dropped to 76% — losing structure hurts comprehension.
- JSON (page content serialized as
{"title": "...", "body": "..."}): roughly same tokens as Markdown, accuracy similar. Useful if you're building structured pipelines, but no clear win over Markdown for raw context.
The takeaway: structure helps comprehension, syntax doesn't have to cost tokens. Markdown is the sweet spot.
Reproducing the test
The harness is straightforward if you want to validate on your own content:
import anthropic
client = anthropic.Anthropic()
def test_page(html_source, markdown_source, question):
for label, source in [("HTML", html_source), ("Markdown", markdown_source)]:
r = client.messages.create(
model="claude-opus-4-7",
max_tokens=1000,
messages=[
{"role": "user", "content": f"{source}\n\n---\n\n{question}"}
],
)
print(f"{label}: input_tokens={r.usage.input_tokens}, response={r.content[0].text[:200]}")
The HTML side comes from your browser's document.documentElement.outerHTML. The Markdown side comes from a clipper or a converter — Web2MD, Jina Reader, or a custom Pandoc/Turndown pipeline. Save both per page, run the loop, score by hand for a sample of 5-10 questions per format.
Practical takeaways
- Always feed Markdown to Claude, not HTML. 42% token savings + 25% accuracy improvement is not marginal.
- The clipper choice matters. A clipper that leaves syntax-highlight residue and navigation noise eats half the benefit. Use one with strong site-specific extractors.
- For Chinese content the gap widens. WeChat / Xiaohongshu HTML is ~50% noise. Markdown conversion is closer to 60% savings.
- For multi-page research sessions, savings compound. A 20-page corpus that costs $4 in HTML costs $2.40 in Markdown — and produces sharper synthesis.
Related
- Markdown vs HTML: which produces better AI answers?
- Markdown tokenization deep dive
- How to reduce LLM token costs (practical)
- How to fill Claude's 1M context window
Install
Web2MD on the Chrome Web Store →
Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + queue + token estimates + bulk export.