Convert WeChat Official Account (公众号) Articles to Markdown for AI Workflows
Convert WeChat Official Account (公众号) Articles to Markdown for AI Workflows
WeChat Official Account articles — 微信公众号文章 — are one of the most underused content sources for AI workflows. Industry analysis, expert deep-dives, founder essays, technical walkthroughs, brand storytelling. The signal-to-noise ratio on quality 公众号 content is often higher than equivalent Western newsletters, partly because the platform doesn't favor short-form posts and partly because the content tends to be paid or community-curated.
And yet, almost no one outside China feeds 公众号 content into Claude, ChatGPT, or Notebook LM. The reason is simple: you can't scrape it.
This article explains exactly why 公众号 articles resist server-side scraping, and what the reliable extraction path looks like in 2026.
Why server-side scraping doesn't work
Tencent treats 公众号 articles (mp.weixin.qq.com/s/<id> URLs) as content that lives inside the WeChat ecosystem. The web view at mp.weixin.qq.com exists for sharing — not for syndication.
Three layers of defense:
1. Aggressive User-Agent and IP filtering. Standard requests and httpx clients get back a stub page with no article content. Even sophisticated headless Chromium configurations get flagged by Tencent's TLS fingerprinting unless you mimic real Chrome carefully.
2. Article-load token validation. The article URL contains short-lived tokens (__biz, mid, idx, sn, chksm) that the server validates on every request. Tokens issued for a user in the WeChat app sometimes work in a browser; tokens scraped from a public link sometimes don't. The behavior is intentionally inconsistent.
3. Lazy-loaded images. Even when you successfully load the article HTML, every image uses data-src instead of src, and the actual image loads only when the user scrolls. A naive scraper that grabs the HTML on first load gets text but no images.
The community workarounds — wechatmp2markdown, wechat-article-extractor, various Python libraries — work intermittently. They tend to break when Tencent rotates anything, which is often.
The browser approach
When you, a real person, open a 公众号 article URL in Chrome, the article loads cleanly because:
- Your browser's User-Agent and TLS handshake look real
- The page hydrates in a real JavaScript runtime
- Lazy-loaded images fire as the page renders
- You're a single-user request, not a scraping pattern
Once the page is rendered in your tab, the article body lives in #js_content — a single, well-defined DOM container. The author info is in #js_name. The publish time is in #publish_time. The IP location is in #js_ip_wording.
A browser extension that runs inside your tab can read all of this. No anti-bot challenge, no token validation, no scraping treadmill — because there's no scraping happening. The extension is reading the page you're already looking at.
What clean extraction looks like
A typical 公众号 article extracted via Web2MD ends up looking like this:
# 创业者必读:从 0 到 1 的真实路径
**某某创业者** · 2026-04-12 · 上海
文章正文第一段,保留原始段落结构和换行...
文章正文第二段...
> 引用块保留为 blockquote
>
> 包括嵌套引用
## 二级标题
代码块也保留:
```python
def example():
return "正常显示"
文章末尾的图片:
LaTeX 公式:
$$ E = mc^2 $$
The lazy-loaded images get hydrated automatically (the extension copies `data-src` into `src` so the converter sees the real URL). The QR code at the end of every 公众号 article gets stripped as noise. The "view in app" prompt and reward-tip iframe get removed.
## Why this matters for AI workflows
Three concrete use cases:
### Industry research
If you're tracking a specific industry in China — fintech, e-commerce, SaaS, manufacturing — the best primary-source content is often in 公众号 articles by working practitioners. Feeding 30 articles into Claude with a prompt like "What are the recurring concerns about $X in this batch?" produces a synthesis that no equivalent English-language source can give you.
### Translation pipelines
Pull a 公众号 article into Markdown, send it to Claude with a "translate to English, preserving technical terminology" prompt template, get back a clean translated draft. The Markdown structure survives the translation — headings stay headings, code blocks stay code blocks.
### RAG over Chinese knowledge
If you're building a RAG over Chinese content for a research agent, 公众号 articles are denser-per-token than the social platforms. A few hundred well-chosen articles can ground an agent in a specific domain better than a million Weibo posts.
### Personal knowledge base
If you read 公众号 articles via WeChat on your phone, you've probably hit the moment of "I want to refer to this in 6 months but I'll never find it again." The fix is to clip the article to Obsidian or Notion when you read it — Web2MD does this directly, with `obsidian://` URI handoff for the vault you specify.
## How to use Web2MD for 公众号
1. Install the [Web2MD Chrome extension](https://chromewebstore.google.com/detail/web2md-web-to-markdown/ijmgpkkfgpijifldbjafjiapehppcbcn).
2. Open any `mp.weixin.qq.com/s/...` URL in Chrome.
3. Press `Ctrl+M` (or `Alt+M` on Windows) — or click the Web2MD icon and hit Convert.
4. The Markdown lands in your clipboard, ready to paste into Claude, ChatGPT, Obsidian, or Notion.
The 公众号 extractor is in the free tier — no Pro subscription needed for this specific feature. Free tier converts up to 3 articles per day. Pro at $9/month removes the daily limit and adds batch convert (useful if you're doing research-style ingestion of many articles in a session).
## Other Chinese platforms
The same browser-based approach is the only reliable path for the other major Chinese platforms that block server-side scraping:
- **Xiaohongshu (小红书 / RED)** — `www.xiaohongshu.com/explore/<noteId>` — extracts title, body, author with IP location, tags, images, and engagement metrics from `__INITIAL_STATE__`. Covered in detail in [How to convert Xiaohongshu posts to Markdown](/blog/xiaohongshu-to-markdown-2026).
- **Zhihu** (`zhihu.com/question/.../answer/...` and `zhuanlan.zhihu.com/p/...`) — supports column articles, single answers, and full questions with the top 5 answers sorted by votes.
- **Bilibili** (`bilibili.com/video/...` and `bilibili.com/read/cv...`) — for videos, extracts title, UP主, description, tags, and engagement stats. For column articles, extracts the full article body.
If you regularly work with Chinese content for AI workflows, having a single tool that handles all four platforms saves significant ongoing maintenance versus stitching together four separate scrapers that break every few months.
## What to use, and what not to use
**Use Web2MD when:**
- You read 公众号 articles in your browser anyway and want to clip them into your AI workflow
- You need clean Markdown with images, formatting, and metadata preserved
- You want to combine this with Xiaohongshu / Zhihu / Bilibili extraction in a single tool
**Use `wechatmp2markdown` (the GitHub project) when:**
- You're a developer building a custom pipeline and you want full open-source control
- You're willing to maintain the extractor against Tencent's rotating defenses
- Browser-based extraction doesn't fit your environment
**Don't use:**
- Generic Readability extensions — they don't handle the lazy-load image pattern correctly
- Server-side scraping libraries — Tencent's defenses make this a treadmill
The browser-based approach isn't the most elegant architecturally, but it's the only one that works reliably across every release of WeChat's web view in 2026.
---
**Related**:
- [How to convert Xiaohongshu (RED) posts to Markdown](/blog/xiaohongshu-to-markdown-2026)
- [Best web clipper in 2026](/blog/web-clipper-comparison-2026-after-markdownload-pocket)
- [Firecrawl alternative that uses your real browser](/blog/firecrawl-alternative-browser-rag-2026)
- [How to feed webpage content to ChatGPT and Claude](/blog/how-to-feed-webpage-content-to-chatgpt-claude)