hacker news markdownhn thread to markdownclaude hacker newshn researchai research workflowyc news aiweb2mdhn comments export

Hacker News Thread to Markdown for Claude Research (2026)

Zephyr Whimsy2026-06-046 min read

Hacker News Thread to Markdown for Claude Research (2026)

Hacker News is the highest-signal technical discussion forum on the open web. A 400-comment thread on system design or a runtime quirk often contains more useful wisdom than any single blog post or documentation page. The problem: getting that thread into Claude or ChatGPT in a form they can actually reason over.

This post is the workflow.

Why HN threads beat almost everything else for research synthesis

I have done the same research question across Reddit, X, LinkedIn, and HN multiple times. HN consistently wins for technical synthesis because:

  • Higher signal density: comments average 3-5 substantive sentences, not 1-line reactions
  • Cited claims: experienced commenters link papers, RFCs, source code
  • Self-correction: incorrect claims get pushed back on within hours, not days
  • Karma signal: vote counts roughly track usefulness for technical content
  • Less promotional content than LinkedIn, less casual chatter than Reddit

For "what does the senior engineer community think about X?" — HN is the canonical first source.

What standard fetchers see

HN's thread page (news.ycombinator.com/item?id=...) has a server-rendered shell with the first ~30 comments inline, then loads the rest as you scroll. ChatGPT browse and Claude WebFetch get:

  • Submission title, URL, points
  • Top-level comments (~10-30)
  • A [more] link for the rest

For threads under 50 comments this is fine. For 200+ comment threads — where the actual substance is in deeper branches — it's almost useless. You get the visible 15% of the discussion.

What clean HN Markdown looks like

After running through an HN-aware extractor:

# What if your build system was just a few hundred lines of code?

**Source**: https://news.ycombinator.com/item?id=12345678
**Submitted by**: user42 · **Points**: 489 · **Comments**: 312
**Posted**: 2026-05-15

## OP comment

I built a small build system in Go that's about 600 lines total. Here's
what it does differently from Bazel/Buck...

## Top thread

- **bcantrill** (84 points): "This is the right direction. The complexity of
  Bazel is a tax most projects pay for features they never use..."
  - **user123** (42 points): "Counter-point: Bazel's remote caching is the
    whole point. A small local build tool can't replicate that..."
    - **bcantrill** (28 points): "Fair, but you can layer caching on top of a
      simpler core. The Buck folks tried this with [link to paper]..."
  - **another_user** (35 points): "Also worth noting: the simpler approach
    breaks down at ~500 targets. Below that it's clearly better."

- **drnewman** (61 points): "Your benchmarks compare against Bazel cold-start
  but Bazel's actual production cost is incremental rebuilds..."

[continues for full thread]

## [dead] and [flagged] markers preserved

About 25-30k tokens for a 300-comment thread. Author karma trajectories, parent-child relationships, and dead/flagged states all preserved. Claude reads this and produces synthesis grounded in specific high-karma comments.

The workflow

Three paths:

Path 1: Web2MD HN extractor (interactive)

Open the HN thread in Chrome. Click Web2MD. The HN-specific extractor:

  • Hits HN's Firebase API behind the scenes to get the full comment tree
  • Preserves nesting up to 5 levels with proper indentation
  • Captures author handle, point count, posted timestamp
  • Marks [dead], [flagged], and [downvoted] comments
  • Formats as clean Markdown ready to paste into Claude or save

End-to-end: ~6 seconds per thread including HN API roundtrip.

Path 2: HN Firebase API + 30-line script

For developers who want batch extraction:

import requests, json

def hn_to_markdown(item_id):
    def fetch(id):
        return requests.get(f"https://hacker-news.firebaseio.com/v0/item/{id}.json").json()

    def render_comment(c, depth=0):
        if not c or c.get("dead") or c.get("deleted"):
            marker = "[dead]" if c.get("dead") else "[deleted]"
            return f"{'  '*depth}- {marker}\n"
        indent = "  " * depth
        author = c.get("by", "unknown")
        text = (c.get("text", "")).replace("\n", f"\n{indent}  ")
        md = f"{indent}- **{author}**: {text}\n"
        for kid_id in c.get("kids", []):
            md += render_comment(fetch(kid_id), depth + 1)
        return md

    root = fetch(item_id)
    md = f"# {root['title']}\n\n**URL**: {root.get('url', 'self post')}\n"
    md += f"**Points**: {root.get('score', 0)} · **By**: {root.get('by')}\n\n"
    for kid_id in root.get("kids", []):
        md += render_comment(fetch(kid_id))
    return md

30 lines, handles the full tree. Hit rate limits at ~10k requests but typical use is well under that.

Path 3: Bulk HN research corpus

# Identify HN threads via algolia search
threads = requests.get("https://hn.algolia.com/api/v1/search?query=your+topic&tags=story").json()
thread_ids = [hit["objectID"] for hit in threads["hits"][:30]]
corpus = "\n\n---\n\n".join(hn_to_markdown(tid) for tid in thread_ids)
# Now paste corpus into Claude

30 threads on one topic, automatically. Combined corpus typically ~500k-1M tokens for substantial discussions.

A real research session

I needed to understand "what's the consensus on monolith vs microservices for early-stage startups in 2026?"

  • Used HN Algolia search for relevant threads from past 18 months
  • Selected 18 substantive threads (each with 100+ comments)
  • Web2MD queue + bulk export: ~25 minutes including skim-reading
  • Combined corpus: ~340k tokens
  • Pasted into Claude Opus 4.7 with the prompt: "These are 18 HN threads on monolith vs microservices for startups. What are the 5 most-upvoted arguments for each side, and where does HN actually agree vs disagree? Cite specific comment authors and threads."

Output: an 8-page synthesis with specific citations (user42 in thread X argued...) and identified consensus zones vs disagreement zones. Total time: ~70 minutes. The manual version would have been a full week of reading.

What HN is not good for

Honest about the limits:

  • Recent breaking news: HN front page shifts daily. For ongoing events, the snapshot becomes stale fast.
  • Non-technical topics: HN's comment quality varies widely outside its core competencies (tech, startups, programming language design). For consumer product discussion, Reddit is better.
  • Original research data: HN comments cite primary sources; they aren't primary sources themselves. Follow the cited links for load-bearing claims.
  • Bias awareness: HN skews male, US-coastal, infrastructure-engineering. The "consensus" reflects that demographic.

Pairing with other workflows

HN content composes well with:

Quick wins

If you already use Web2MD, open any HN thread and click the extension. The HN-specific extractor produces what's shown above. Free tier handles 3 conversions/day.

For dev workflows, the HN Firebase API (above) + 30 lines of Python gets you the full pipeline. HN's API has no auth and very lenient rate limits — built for exactly this kind of access.

Install

Web2MD on the Chrome Web Store →

Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + queue + bulk export + dedicated HN extractor that hits the Firebase API for full comment trees.

Related Articles