reddit apireddit jsonprawpushshiftreddit scrapingdeveloper toolsweb2mdjson endpoint

Reddit JSON API vs Scraping: The Honest 2026 Comparison for Developers

Zephyr Whimsy2026-05-275 min read

Reddit JSON API vs Scraping: The Honest 2026 Comparison for Developers

There are five paths to read Reddit content programmatically in 2026. Each has real tradeoffs. This post tested all of them with the same goal: pull 200 threads from a niche subreddit, get clean structured data, output as Markdown.

Path 1: The .json endpoint (still works, still underrated)

Every public Reddit URL accepts .json as a suffix:

https://www.reddit.com/r/ObsidianMD/comments/abc123/thread/.json

Returns a JSON array with the post and comment tree. Some practical notes:

  • No auth required for public content. Rate-limited around 60 req/min from an unauthenticated client.
  • Same data as Reddit's own clients. Score, body, author, timestamp, edited flag, nested replies.
  • No deprecation in sight. Reddit has restricted Pushshift, locked down user data, and raised API prices — but the per-URL .json endpoint is part of how Reddit itself works and has stayed open.

The catch: you parse JSON yourself. There is no built-in Markdown serializer.

Quick example

import requests
url = "https://www.reddit.com/r/ObsidianMD/comments/abc123/thread/.json"
r = requests.get(url, headers={"User-Agent": "research-script/1.0 by /u/yourname"})
data = r.json()
post = data[0]["data"]["children"][0]["data"]
comments = data[1]["data"]["children"]

Set the User-Agent. Reddit blocks generic python-requests/X and curl/X UAs.

Path 2: PRAW (the Python wrapper)

PRAW (Python Reddit API Wrapper) is the de-facto library. Handles OAuth, rate limiting, pagination.

import praw
reddit = praw.Reddit(client_id="...", client_secret="...", user_agent="research/1.0")
for submission in reddit.subreddit("ObsidianMD").top(time_filter="month", limit=50):
    print(submission.title, submission.score)

Pros: Polished, well-documented, handles edge cases (deleted users, removed comments, archived threads). Cons: Needs OAuth registration. The free tier is 100 req/min — generous for personal use, insufficient for anything commercial.

Path 3: Server-side HTML scraping (the one that doesn't work)

The path most generic tools take, and the path that mostly fails in 2026:

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/ObsidianMD/comments/abc123/thread/")
soup = BeautifulSoup(r.content, "html.parser")

What you get back:

  • The page shell.
  • Login banner.
  • Navigation.
  • Maybe — if Reddit decides — the first post and 1-2 comments.
  • Almost certainly not the rest of the comment tree, which is rendered after JavaScript hydrates.

If you switch to Playwright to handle JS rendering, you hit Cloudflare. If you handle Cloudflare, you hit Reddit's own detection. Around 50 requests in, you're rate-limited or shadow-banned for a few hours.

Verdict: don't do this in 2026. The .json endpoint is right there, faster, and not actively fought against.

Path 4: Pushshift / Arctic Shift (historical data only)

Pushshift was the de-facto archive of all Reddit content. Reddit restricted Pushshift access in 2023; the public API is gone for general use.

The community-maintained successor is Arctic Shift, which mirrors a subset of historical Reddit data. Coverage is partial — newer threads may not be indexed; recent edits may not be reflected.

Use case: historical analysis of subreddit trends over years. Not for current threads.

Path 5: Browser-side extraction (the no-code path)

For non-developers and for "I just want this thread in my AI prompt," the simplest path is a Chrome extension that does Path 1 internally:

Web2MD's Reddit extractor:

  • Hits the .json endpoint for the current thread URL.
  • Parses the JSON tree.
  • Outputs clean Markdown — title, author, score, body, full nested comment tree.
  • Preserves Reddit's own markdown formatting.

Cost: one click per thread, no API key, no code.

This is technically Path 1 with a UI wrapper. It is what most AI-research workflows actually need: the .json data, formatted for paste-into-Claude, with the ergonomics of "click button, paste."

When to pick which path

| Use case | Best path | |---|---| | One-off thread → AI prompt | Path 5 (Web2MD or any clipper) | | 50 threads → AI corpus for a research session | Path 5 with queue/bulk export, OR Path 1 with a Markdown serializer | | Personal research script, < 100 req/min | Path 1 (.json) or Path 2 (PRAW) | | Heavy programmatic use, commercial | Path 2 (PRAW) with paid API access | | Historical subreddit analysis | Path 4 (Arctic Shift) | | Server-side scrape | Stop. Go to Path 1. |

Markdown serializer for Path 1 (free starter)

If you go Path 1 and want Markdown output, here's a minimal serializer:

def comment_to_md(comment, depth=0):
    indent = "  " * depth
    score = comment.get("score", 0)
    author = comment.get("author", "[deleted]")
    body = comment.get("body", "").replace("\n", f"\n{indent}")
    md = f"{indent}- **{author}** (score: {score})\n{indent}  {body}\n"
    replies = comment.get("replies")
    if isinstance(replies, dict):
        for child in replies["data"]["children"]:
            md += comment_to_md(child["data"], depth + 1)
    return md

def thread_to_md(thread_json):
    post = thread_json[0]["data"]["children"][0]["data"]
    md = f"# {post['title']}\n\n"
    md += f"**Source:** https://reddit.com{post['permalink']}\n"
    md += f"**Author:** /u/{post['author']} · **Score:** {post['score']}\n\n"
    md += f"{post.get('selftext', '')}\n\n## Comments\n\n"
    for c in thread_json[1]["data"]["children"]:
        md += comment_to_md(c["data"])
    return md

That's ~25 lines. Run it against the .json endpoint output and you have a personal Reddit-to-Markdown pipeline.

What I actually use

Honest answer: Web2MD for interactive single-thread work, custom Python script using .json for bulk corpus building, PRAW for anything that needs OAuth-gated user-specific data. Three paths, three different jobs.

Install

Web2MD on the Chrome Web Store →

Free tier: 3 conversions per day. Pro at $9/mo for unlimited + queue + bulk export + REST/MCP API.

Related Articles