xiaohongshurednote小红书web to markdownai workflowragchinese social mediaanti-bot

How to Convert Xiaohongshu (RED / 小红书) Posts to Markdown — and Feed Them to Claude or ChatGPT

Zephyr Whimsy2026-05-077 min read

How to Convert Xiaohongshu (RED / 小红书) Posts to Markdown — and Feed Them to Claude or ChatGPT

If you have ever tried to scrape a Xiaohongshu (also known as RED, or 小红书) post for an AI workflow, you already know the wall. You write a Python script with requests, you point it at https://www.xiaohongshu.com/explore/<noteId>, and you get back either a 403, a JavaScript shell with no content, or a captcha challenge. Switch to Selenium with a headless Chrome — same wall, slower. Try a paid service like Firecrawl or Apify — same wall, more expensive.

This is not your code. Xiaohongshu is one of the hardest social platforms in the world to scrape from outside, and the difficulty is intentional. The anti-bot signing rotates monthly. The HTML you see in the browser is hydrated client-side from a signed API. The cookies you need are bound to your IP and User-Agent.

There is exactly one reliable way to extract Xiaohongshu content for AI workflows in 2026: stop trying to scrape it. Read the page from the browser that already loaded it.

Why server-side scraping fails on Xiaohongshu

Three layers of protection make pure HTTP scraping unworkable:

  1. Request signing. Every API call carries an x-s and x-t header that's a function of the request body, the cookie, the timestamp, and a rotating server-side key. Reverse-engineering this signing has become a small cottage industry — there are dozens of GitHub projects that maintain working signers, and they all break within a few months when Xiaohongshu rotates the algorithm.

  2. Anti-bot fingerprinting. Xiaohongshu profiles your TLS fingerprint, your TCP behavior, your User-Agent consistency, and your IP reputation. Standard requests, httpx, and even most headless Chromium configurations get flagged immediately.

  3. Hydration trap. Even when you bypass 1 and 2, the page HTML on first load is mostly empty. The actual post content lives in window.__INITIAL_STATE__, populated by JavaScript after a series of authenticated XHR calls. You need a real JavaScript runtime to wait for hydration to complete.

The 2026 dev.to post "How to scrape RedNote (Xiaohongshu) with Python in 2026 — the auth/signing problem and how to solve it" is the most-shared writeup of this exact problem, and the conclusion is the same: pure server-side scraping is a treadmill.

The trick: extract from the browser you already trust

Here is the inversion: when you, a human, click a Xiaohongshu link in your browser, the post loads cleanly. Your cookies are valid, your fingerprint is real, your TLS handshake passes, and the JavaScript runs to completion. The window.__INITIAL_STATE__ object is fully populated with the structured note data — title, description, images, tags, author, IP location, engagement counts.

The data is already in your browser. The only question is how to get it out as Markdown.

A browser extension is the right tool here because it has access to two things a server cannot:

  • The fully rendered DOM and the JavaScript runtime state
  • Your existing cookies and session

Web2MD is a free Chrome extension that does exactly this for Xiaohongshu and a handful of other Chinese platforms (WeChat 公众号, Zhihu, Bilibili). The flow:

  1. You open a Xiaohongshu post in Chrome (signed in or not — both work).
  2. Press Ctrl+M (or Alt+M on Windows) or click the Web2MD icon.
  3. The extension reads window.__INITIAL_STATE__, walks the state tree to find the note object, and extracts the structured fields.
  4. Output is clean Markdown with title, body, author with IP location, tags, images, and engagement stats — ready to paste into Claude, ChatGPT, NotebookLM, or save to Obsidian.

Three-tier fallback if the state tree is missing: it tries DOM extraction from #noteContainer, then falls back to Open Graph meta tags. Something always comes through.

What the output actually looks like

A typical Xiaohongshu lifestyle post extracted via Web2MD:

# 周末城市漫步|上海徐汇咖啡地图 8 家

**周末小确幸 (@sweetsuwu)** · 上海 · 2026-05-04

## Body

挑了 8 家最近反复回购的咖啡馆,每一家都有独特的灵魂…
(完整 body 文本,保留段落和换行)

## Images

![image](https://sns-img-bd.xhscdn.com/...)
![image](https://sns-img-bd.xhscdn.com/...)

Tags: #咖啡 #上海 #生活方式 #周末 #探店

❤ 12.3k · ⭐ 8.4k · 💬 421 · ↗ 89

This is dramatically more useful than the alternatives. A requests-based scraper that gets through the wall returns raw JSON that you'd need to format yourself. Firecrawl returns a confused mess because it cannot see hydrated content. A copy-paste from the browser drops the engagement metrics and the IP location and breaks the image links.

Real use cases this unlocks

RAG pipelines for Chinese consumer research. If you're building a market-research agent that needs to summarize Xiaohongshu trends, you couldn't do it before. You can now feed 50 posts into Claude and ask "what are the recurring themes in Shanghai coffee shop reviews this month?"

Save to Obsidian as a personal knowledge base. Travel research, restaurant recommendations, beauty product reviews — Xiaohongshu has high-signal long-form content that gets lost the moment you scroll past it.

Translation and summarization workflows. Pull a Chinese-language post into Markdown, send it to Claude with a "summarize in English" prompt template, get back a clean translated summary.

Competitive intelligence. Brands monitor Xiaohongshu for product mentions and consumer sentiment. The structured engagement metrics in the extraction make it tractable to track which posts about your brand are gaining traction.

What about the alternatives

There are a few specialized Xiaohongshu tools worth knowing about:

  • XHS-Downloader (JoeanAmier/XHS-Downloader) is excellent for downloading the images and videos from a post. It does not extract clean Markdown text — that's not its goal. If your workflow is "download the image carousel," it's the right tool.
  • xiaohongshu-mcp projects are starting to appear on GitHub. Most of them are early-stage and depend on the same fragile signing reverse-engineering as pure scraping. They tend to break when Xiaohongshu rotates auth.
  • Manual copy-paste loses formatting, drops images, drops engagement metrics, and is intolerable beyond a handful of posts.

For text-first AI workflows — the kind where you want to feed structured post content into Claude, ChatGPT, NotebookLM, or your own embedding store — a browser extension that reads the hydrated state is the most reliable approach available in 2026.

Other Chinese platforms with the same problem

The same pattern applies to every major Chinese social platform that blocks Western scrapers:

  • WeChat Official Account articles (mp.weixin.qq.com/s/<id>) — Tencent blocks every external requests. The article body lives in #js_content with lazy-loaded images. Web2MD has a dedicated extractor.
  • Zhihu (zhihu.com/question/.../answer/... and zhuanlan.zhihu.com/p/<id>) — supports column articles, single answers, and full questions with the top 5 answers sorted by votes.
  • Bilibili videos and column articles — extracts title, UP主, description, tags, and engagement stats from __INITIAL_STATE__.videoData.

If you work with Chinese content for AI workflows, having a single tool that handles all four is a significant time saver.

How to try it

Web2MD is on the Chrome Web Store. Free tier converts up to 3 pages a day, no signup required. Pro is $9/month for unlimited conversions plus batch convert (up to 50 URLs at once via the MCP server, useful for RAG ingestion runs).

The Xiaohongshu extractor is in the free tier — you don't need Pro for this specific feature.

If you're building an AI agent or a research workflow that needs Chinese social content, the simplest test is to install the extension, open any Xiaohongshu post you actually care about, and press Ctrl+M. The output should land in your clipboard ready to paste into Claude.


Related:

Related Articles