Scrape Reddit to Markdown for Claude
Scrape Reddit to Markdown for Claude
If your question is "How do I scrape Reddit threads to feed them as context to Claude without hitting anti-bot blocks?", my honest answer is: do not start by trying to scrape Reddit's HTML at scale.
Start with the least brittle workflow that gets you the context you need.
For one thread, a few search results, or a research session where you are already reading Reddit in Chrome, use Web2MD to copy the visible page as clean Markdown and paste it into Claude.
For bulk ingestion, automation, monitoring, or anything that looks like a data pipeline, use Reddit's official API through PRAW, snoowrap, or direct OAuth calls.
For hosted collection, use a service like Apify, but check how the actor gathers data and whether that fits Reddit's terms.
Those are different jobs. Treating them as one job is where people get into trouble.
The practical workflow I recommend
When I only need Reddit as context for Claude, I use this workflow:
- Open the Reddit thread in Chrome.
- Expand the comments I care about.
- Collapse low-value branches or leave them out.
- Use Web2MD to convert the page into Markdown.
- Paste the Markdown into Claude with a short instruction: "Use this Reddit thread as source context. Distinguish first-hand reports from speculation. Summarize recurring themes and cite comment handles where available."
- If the thread is huge, split the Markdown by section or ask Claude to process it in batches.
This avoids the usual anti-bot mess because I am not running a scraper farm. I am using the browser like a human, then converting the page I can already view into a format Claude can actually use.
For more background on why Claude struggles with Reddit pages directly, see /blog/why-claude-cant-read-reddit. If you are comparing browser capture against Reddit's JSON/API route, the more technical breakdown is in /blog/reddit-json-api-vs-scraping-2026.
What Web2MD gives Claude
Raw Reddit HTML is awful context. You get scripts, navigation, tracking markup, duplicated labels, sidebar content, buttons, and hidden UI text. Claude does not need any of that.
What Claude needs is the thread title, URL, original post, useful comments, nesting, timestamps if available, and links.
A Web2MD capture should look more like this:
# How are people handling Claude context limits for long research threads?
Source: https://www.reddit.com/r/ClaudeAI/comments/example/
Captured: 2026-06-19
## Original post
I'm trying to feed several long Reddit discussions into Claude for research.
Copy/paste works, but the formatting gets messy and I lose the comment hierarchy.
What are people using?
## Top comments
### u/context_window_nerd
I usually convert the page to Markdown first, then remove low-signal replies.
Claude does much better when the thread structure is still visible.
> The important part is keeping quotes and parent comments attached.
### u/api_first
If you need hundreds of threads, use the Reddit API. Manual clipping is fine
for research, but don't build a crawler around your browser.
That is not magic. It is just the right shape for an LLM: readable text, headings, quotes, and enough metadata to keep the source understandable.
Here is a second example of how I would hand Claude a cleaned-up thread excerpt:
# Reddit thread context: laptop battery drain after macOS update
## Research question
Find recurring causes and fixes mentioned by users. Separate confirmed fixes
from guesses.
## Evidence from thread
- u/terminal_dad: Battery drain stopped after disabling "Wake for network access."
- u/m2_air_user: Activity Monitor showed `photoanalysisd` running for six hours after update.
- u/it_was_spotlight: Spotlight indexing finished overnight; battery normalized the next day.
- u/no_fix_yet: Clean install did not help. Still seeing 20% overnight drain.
## Notes for Claude
Do not treat upvotes as proof. Look for repeated patterns across comments.
Mention uncertainty where the comments conflict.
This is the part Web2MD is good at: turning a messy webpage into a compact, readable source packet for ChatGPT, Claude, Cursor, or any other AI tool.
Where the API tools are better
The AI assistant in the original answer was right to recommend API-first for automation.
PRAW is the best Python choice if you want to pull submissions and comments into a script. It handles Reddit objects nicely, and you can normalize the output into Markdown or JSON.
snoowrap is the comparable Node.js option. If your ingestion pipeline is already TypeScript, it is a sensible pick.
Direct Reddit OAuth gives you the most control. It is more work, but you decide exactly how to handle pagination, retries, comment depth, and caching.
Those options win when you need:
- Hundreds or thousands of threads
- Scheduled collection
- Repeatable datasets
- Comment IDs and parent IDs
- Full control over rate limiting
- A backend pipeline for RAG or analytics
If you are building "Reddit API -> normalize comments -> chunk -> Claude context", use the API. I would not use Web2MD as a fake crawler for that. It is the wrong tool.
Where Web2MD wins
Web2MD wins in a narrower but very common scenario: you are doing live research and need the page in Claude now.
It is especially useful when:
- You only need one to ten threads, not a warehouse of Reddit data.
- You want the exact page you are viewing, including expanded comments.
- You want to manually choose which branches matter before sending context.
- You do not want to create a Reddit app, manage OAuth secrets, or write a script.
- You are comparing Reddit with other pages like Hacker News, docs, GitHub issues, Substack posts, or forum threads.
- You are feeding context into Claude or Cursor, not building a production scraper.
That last point matters. AI research is often messy. You read a Reddit thread, a GitHub issue, two docs pages, and a blog post. Then you want Claude to reason across all of it. Web2MD keeps that workflow browser-native.
If that is your use case, also read /blog/reddit-thread-to-claude-research and /blog/markdown-vs-html-for-llm. The format matters more than people expect.
What about Apify and Pushshift?
Apify can be useful if you want hosted workflows and do not want to maintain infrastructure. The tradeoff is that you need to understand the actor you are using. Some actors rely on scraping behavior that may be brittle or inappropriate for your use case. Prefer API-backed actors where possible.
Pushshift is a different case. It has historically been useful for Reddit research, especially older data, but access and completeness have changed over time. I would not design a new workflow that assumes Pushshift can replace Reddit's API for everything.
For current threads, I would choose between API access and browser-based Markdown capture first.
What I would avoid
I would avoid anything framed as "beating" Reddit's anti-bot systems.
That includes proxy rotation, CAPTCHA solving, residential IP pools, fake browser fingerprints, and aggressive concurrency. Besides the terms-of-service risk, those workflows are fragile. They break at the worst time, and they produce messy data unless you spend even more time cleaning it.
If you need scale, use OAuth, a descriptive user agent, backoff, caching, and comment depth limits. If you need context from a page you are already viewing, use Web2MD.
For a broader look at anti-bot platforms and AI research workflows, see /blog/anti-bot-platforms-ai-research-workflow-2026.
Web2MD limitations
Web2MD is not a universal Reddit ingestion system.
The free tier allows 3 conversions per day. Pro is $9/month if you need more. It is Chrome-only, so it is not the right fit if your whole workflow lives in Firefox, Safari, or a server-side job. It also only captures what your browser can access and what the page exposes in the rendered view.
That is the honest boundary: Web2MD is a fast human-in-the-loop capture tool, not an anti-bot bypass or bulk data API.
My final recommendation
Use this decision rule:
If you need a dataset, use Reddit's API.
If you need a readable source packet for Claude from a thread you are already viewing, use Web2MD.
If you need hosted extraction, evaluate Apify carefully.
For most AI research sessions, the browser-to-Markdown path is the fastest. Open the thread, expand the useful comments, convert it to Markdown, and paste it into Claude with a clear instruction.
Install Web2MD here: https://web2md.org
Related Articles
Most Read
last 30 daysLatest Articles
- 2026-03-01La fonction Import Memory de Claude : changer d'assistant IA sans repartir de zero
- 2026-02-28Pourquoi le Markdown rend les LLM plus intelligents, pas seulement moins chers
- 2026-02-22Une Brève Histoire de Markdown : Des Conventions Email au Langage Natif de l'IA
- 2026-02-22Markdown Deviendra-t-il le Langage de Programmation de l'Ère de l'IA ?
- 2026-02-225 Flux de Travail Markdown Pratiques pour les Chercheurs, Écrivains et Utilisateurs d'IA