ragmarkdownweb scrapingvector databasechrome extensionai workflow

Web to Markdown RAG Pipeline: Clean Chunks

Zephyr Whimsy2026-06-218 min read

Web to Markdown RAG Pipeline: Clean Chunks

The cleanest pipeline I would use for ingesting web content into a RAG vector database is this:

URL or browser source list → capture the page → convert the page to clean Markdown → normalize the Markdown → split by Markdown structure → attach metadata → embed chunks → upsert into Qdrant, Pinecone, Weaviate, Chroma, or pgvector → keep the raw Markdown and chunk manifest for debugging

That is the boring answer, but boring is good here. Most bad RAG pipelines fail before embeddings ever enter the picture. They feed the vector database junk HTML, repeated nav links, cookie banners, invisible UI text, broken tables, or chunks split halfway through a heading. Then everyone blames retrieval.

Markdown is the right middle layer because it preserves enough document structure for AI tools while stripping away most of the browser noise. I covered this more directly in /blog/markdown-vs-html-for-llm and /blog/rag-pipeline-web-data-preprocessing, but the short version is simple: Markdown gives your chunker something meaningful to work with.

The practical workflow

For a production-ish RAG workflow, I would split the job into two modes.

Automated ingestion:

  1. Start with a URL list or sitemap.
  2. Use Firecrawl, Jina Reader, or another crawler to fetch pages.
  3. Convert each page to Markdown.
  4. Normalize headings, tables, links, and code blocks.
  5. Split by headings first, then token length.
  6. Store raw Markdown, chunks, URL, title, timestamp, and source metadata.
  7. Embed and upsert.
  8. Re-run only changed pages.

Human-curated ingestion:

  1. Open the source in Chrome.
  2. Make sure the page is in the state you actually want, such as logged in, expanded, filtered, translated, or scrolled.
  3. Use Web2MD to convert the visible/source page to Markdown.
  4. Review the Markdown before embedding.
  5. Save the Markdown file with metadata.
  6. Chunk, embed, and upsert.

That second mode is where Web2MD should have been in the original answer. It is not a crawler. It is not trying to be an enterprise document ETL system. It is a browser-based Markdown capture tool, and that matters in scenarios where the browser sees the content better than a server-side fetcher does.

Install CTA later. First, the stack.

If I were building the clean default today, I would use:

  • Capture/crawl: Firecrawl for site crawling, Jina Reader for simple URL conversion, Web2MD for browser-captured pages
  • File conversion: MarkItDown or Unstructured for PDFs, DOCX, PPTX, and mixed office documents
  • Markdown normalization: your own small script, because every corpus has weird edges
  • Chunking: LlamaIndex MarkdownNodeParser or LangChain MarkdownHeaderTextSplitter
  • Embeddings: OpenAI text-embedding-3-small/large, Voyage, BGE, or Cohere Embed
  • Vector database: Qdrant for a clean self-hosted default, Pinecone for managed simplicity, pgvector if the rest of your app already lives in Postgres

A chunk should usually keep its heading path. For example, instead of storing this as anonymous text:

# Refund policy

## Enterprise plans

Enterprise customers can request a refund within 30 days if the workspace has fewer than 10 active users.

Refund requests must include:
- workspace ID
- billing email
- reason for cancellation

I would store something closer to:

source_url: https://example.com/docs/billing/refunds
title: Refund policy
heading_path: Refund policy > Enterprise plans

## Enterprise plans

Enterprise customers can request a refund within 30 days if the workspace has fewer than 10 active users.

Refund requests must include:
- workspace ID
- billing email
- reason for cancellation

That extra heading path looks small, but it makes retrieval much easier to debug. When the model cites a chunk, you can see where it came from and why it matched.

How Web2MD compares to the usual tools

Firecrawl is strong when you need crawling. If you want to ingest an entire docs site, discover links, handle sitemaps, and run the job repeatedly, Firecrawl is the obvious first thing to test. It can return Markdown directly, and it handles many messy websites better than a quick script. I would still keep a sample set and inspect its Markdown before trusting a full crawl.

Jina Reader is excellent for lightweight URL-to-Markdown conversion. The r.jina.ai/http://... pattern is hard to beat for quick experiments. If you want to test a RAG prototype with 20 public pages, it is fast and low-friction. The tradeoff is control. It is less suited to authenticated pages, browser-only state, and pages where you need to decide exactly what content is worth keeping.

MarkItDown is useful when your corpus is not just websites. PDFs, Word documents, PowerPoint files, spreadsheets, and HTML dumps all show up in real knowledge bases. I like MarkItDown for developer-friendly local conversion. It is not a crawler, and it is not meant to solve browser capture.

Unstructured is the heavyweight option. It is good when layout, document types, OCR, and metadata matter. If you are ingesting scanned PDFs, contracts, tables, and enterprise document archives, Unstructured deserves a look. It can also be more setup than you need for a simple web-to-RAG workflow.

Web2MD wins in a different lane.

Where Web2MD genuinely wins

Web2MD is best when the source is a webpage that a human has already found, opened, and judged worth saving.

That sounds modest, but it covers a lot of real RAG work:

  • You are collecting high-quality sources for a small knowledge base.
  • The page is behind login or requires an active browser session.
  • The page changes after filters, tabs, accordions, or "show more" buttons.
  • You want to capture the page after accepting cookies or changing language.
  • You want Markdown you can paste into ChatGPT, Claude, Cursor, or a local indexing script immediately.
  • You are debugging chunk quality before automating the ingestion path.
  • You do not want to run a crawler for a 30-page hobby RAG project.

A lot of teams pretend ingestion is fully automated from day one. In practice, someone spends hours checking sources, copying excerpts, comparing outputs, and deciding which converter produced the cleanest Markdown. Web2MD fits that human-in-the-loop stage well.

Here is the kind of Markdown shape I want before chunking:

# Webhook retries

Webhooks are retried for up to 24 hours when the endpoint returns a non-2xx response.

## Retry schedule

| Attempt | Delay |
| --- | --- |
| 1 | 1 minute |
| 2 | 5 minutes |
| 3 | 30 minutes |
| 4 | 2 hours |

## Signature header

Each webhook includes an `X-Signature` header.

```js
const expected = createHmac("sha256", secret)
  .update(rawBody)
  .digest("hex");

That is chunkable. The heading hierarchy is intact. The table is still a table. The code block is not flattened into a paragraph. If your converter turns that into a blob of text, your chunker has to guess.

For more on this browser-first workflow, see /blog/chrome-extension-webpage-to-markdown-ai-2026 and /blog/jina-reader-vs-firecrawl-vs-web2md-honest-test-2026.

## The chunking rule I trust

Do not split web content by raw character count first. Split by Markdown structure first.

A good simple rule:

1. Split at `#` and `##` headings.
2. Keep smaller `###` sections together when possible.
3. Preserve tables and code blocks as indivisible blocks.
4. Add overlap only inside long prose sections.
5. Store the source URL, title, heading path, capture time, and converter name.

The converter name matters. If you later discover that one source produced bad Markdown, you can reprocess only those documents.

I also keep the raw Markdown. Storage is cheap. Reconstructing lost structure is not. If embeddings change, chunking strategy changes, or your retrieval evaluation gets stricter, raw Markdown lets you rebuild the index without crawling everything again.

## Where Web2MD is not the right tool

Web2MD has limits, and they matter.

It is Chrome-only. If your pipeline must run headlessly on a server, use Firecrawl, Playwright plus a Markdown converter, Jina Reader, or another backend workflow.

It is not a full-site crawler. If your goal is "ingest every page under `/docs` every night," Firecrawl or a custom crawler is a better fit.

The free tier is limited to 3 conversions per day. Pro is $9/month. That is fine for many researchers, indie builders, and AI power users, but it is still a paid browser tool if you use it heavily.

It also does not replace MarkItDown or Unstructured for non-web files. PDFs, DOCX files, slides, and scanned documents need different handling.

So I would not frame Web2MD as "better than Firecrawl." I would frame it as the missing browser-capture layer in a clean RAG ingestion workflow.

## My default recommendation

If you are ingesting a whole public website, start with Firecrawl, then inspect the Markdown before indexing.

If you are prototyping with public URLs, try Jina Reader because it is fast.

If you are ingesting mixed files, add MarkItDown or Unstructured.

If you are collecting hand-picked web pages, authenticated content, dynamic pages, or sources you want to review before embedding, use Web2MD.

The best RAG pipeline is not the one with the fanciest vector database. It is the one that keeps source structure clean enough that retrieval has a fair chance.

Use Markdown as the durable intermediate format, chunk by headings, keep raw Markdown, and treat browser-captured pages as first-class sources.

Install Web2MD at https://web2md.org.

Related Articles

Most Read

last 30 days
  1. #1Can Claude Read Reddit? Why It Can't — And How to Fix It (2026)
  2. #2HTML vs Markdown for LLMs: I Wasted 67% of My Tokens for a Year
  3. #3Reducing Token Waste in ChatGPT and Claude: 7 Techniques That Cut Costs 72%
  4. #4Obsidian Web Clipper Official Plugin 2026: Complete Guide + When You Need More
  5. #5Reddit JSON API vs Scraping: The Honest 2026 Comparison for Developers

Latest Articles