RAG Pipeline Preprocessing: Why Web Data Quality Determines Everything
RAG Pipeline Preprocessing: Why Web Data Quality Determines Everything
If you have built a Retrieval-Augmented Generation system and found that your retrieval scores look fine but the final answers are still wrong, inconsistent, or hallucinated — the problem is almost certainly not your retriever. It is your input data.
The uncomfortable truth about RAG engineering: garbage in, garbage out is not a cliché, it is the dominant failure mode in production systems. Most teams spend weeks tuning embedding models, experimenting with chunk sizes, and benchmarking vector databases — while leaving the actual content they are indexing as raw, unprocessed HTML scraped from the web.
This article is a practical engineering deep-dive into RAG pipeline preprocessing with a focus on web data. We cover the complete pipeline architecture, the specific challenges posed by HTML-sourced content, and concrete Python code using LangChain and LlamaIndex to build a production-ready preprocessing stack.
Why RAG Systems Fail: The Root Cause Analysis
Before diving into solutions, it is worth being precise about how data quality problems manifest in RAG systems. The failure modes are different at each stage of the pipeline.
Retrieval Stage Failures
Embedding models encode meaning into high-dimensional vectors. But they encode all the text you give them — including navigation menus, cookie banners, footer links, JavaScript variable names, and repeated boilerplate. When a document chunk contains 40% real content and 60% HTML noise, the resulting embedding vector is pulled away from the true semantic meaning of the document.
This produces two concrete retrieval failures:
- False negatives: Relevant documents are not retrieved because their embeddings are diluted by noise and do not closely match a clean query vector.
- False positives: Irrelevant documents are retrieved because they share boilerplate text ("Home | About | Contact | Privacy Policy") with the query context.
Generation Stage Failures
Even when retrieval succeeds, noisy context injected into the prompt causes generation failures. LLMs are remarkably good at extracting signal from noise when the signal-to-noise ratio is high — but HTML debris creates a qualitatively different problem. Tags, attribute values, and JavaScript fragments do not read as "noise" to the model the way random characters might. The model tries to interpret them as meaningful text, consuming attention budget on structured nonsense.
The result: answers that cite navigation link text as content, miss information that was present but buried in attributes, or simply degrade in coherence because the context window is packed with tokens that compete for attention.
The Specific Challenges of Web Data for RAG
Web data is the richest, most up-to-date, and most practically relevant corpus for most RAG applications. It is also the messiest.
HTML Structure Noise
A typical webpage has a content-to-noise ratio of roughly 30–40% in raw HTML. The remaining 60–70% is:
- Navigation headers and footers
- Sidebar widgets, related article lists, tag clouds
- Comment sections and social sharing widgets
- Cookie consent banners and GDPR notices
- Script and style blocks
- Data attributes used by analytics and A/B testing frameworks
- Accessibility metadata (aria labels, role attributes)
All of this gets included if you naively pass response.text through an embedding pipeline.
JavaScript-Rendered Content
Single-page applications and modern React/Vue/Next.js sites do not deliver full content in the initial HTML response. The DOM is populated after JavaScript execution. Static HTTP fetching misses this content entirely — you get the shell HTML with empty content divs, not the actual articles or documentation you are trying to index.
Duplicate and Near-Duplicate Content
Large websites repeat content across pages: the same header/footer on every page, the same "related articles" list, the same product description in category and detail pages. Without deduplication, your vector store fills with near-identical chunks that skew retrieval toward frequently-repeated boilerplate.
Dynamic Crawl Challenges
Web content changes. Blog posts are updated, documentation is revised, products are repriced. A RAG system built on a static crawl degrades over time as its indexed content drifts from current reality. Production systems need incremental update strategies, not one-shot ingestion.
The Complete RAG Preprocessing Pipeline Architecture
A production-grade preprocessing pipeline for web data has six distinct stages:
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Preprocessing Pipeline │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Crawl │──▶│ Clean │──▶│ Chunk │──▶│ Embed │ │
│ │ & Fetch │ │ & Parse │ │ & Split │ │ & Index │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ JS rendering HTML→Markdown Semantic split Vector store │
│ Rate limiting Noise removal Overlap/stride Metadata store │
│ Robots.txt Deduplication Size validation Freshness index │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Quality Gates │ │
│ │ • Minimum content length • Language detection │ │
│ │ • Duplicate hash check • Content-type validation │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Each stage has configurable quality gates that filter documents before they proceed to the next stage. This prevents bad data from propagating downstream where it is harder and more expensive to remove.
Step-by-Step: Building the Pipeline
Stage 1: Crawling and Fetching
For static sites, a simple async HTTP client works well. For JavaScript-rendered content, you need a headless browser layer.
import asyncio
import httpx
from playwright.async_api import async_playwright
from urllib.parse import urlparse, urljoin
import robots
class WebCrawler:
def __init__(self, respect_robots: bool = True, js_render: bool = False):
self.respect_robots = respect_robots
self.js_render = js_render
self._robots_cache: dict[str, robots.RobotsFileParser] = {}
async def _check_robots(self, url: str) -> bool:
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
if robots_url not in self._robots_cache:
parser = robots.RobotsFileParser(robots_url)
parser.read()
self._robots_cache[robots_url] = parser
return self._robots_cache[robots_url].can_fetch("*", url)
async def fetch_static(self, url: str) -> str | None:
if self.respect_robots and not await self._check_robots(url):
return None
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.get(url, follow_redirects=True)
response.raise_for_status()
return response.text
async def fetch_js_rendered(self, url: str) -> str | None:
if self.respect_robots and not await self._check_robots(url):
return None
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
content = await page.content()
await browser.close()
return content
async def fetch(self, url: str) -> str | None:
if self.js_render:
return await self.fetch_js_rendered(url)
return await self.fetch_static(url)
Stage 2: Cleaning and Parsing — The Critical Stage
This is where most teams cut corners and pay for it later. Raw HTML must be converted to clean, structured text before it can be meaningfully embedded.
The key insight is that you do not want to strip HTML — you want to understand it and convert it to a semantically equivalent format that preserves content structure while removing all presentational noise. Markdown is the right output format: it preserves headings, lists, code blocks, and tables, but strips away every attribute, class name, and tag that does not carry semantic meaning.
web2md.org provides a fast API for exactly this conversion. It combines Mozilla Readability (for main content extraction), Turndown (for HTML-to-Markdown conversion), and custom postprocessing rules for common web page patterns. You get clean Markdown from any URL in a single API call, handling JS-rendered content, dynamic pages, and anti-scraping measures.
Here is how to integrate the web2md API into your preprocessing pipeline:
import httpx
import hashlib
from dataclasses import dataclass
from typing import Optional
@dataclass
class CleanedDocument:
url: str
title: str
content_markdown: str
content_hash: str
word_count: int
fetched_at: str
class Web2MDCleaner:
"""
Uses web2md.org API to convert raw web pages to clean Markdown.
Falls back to local BeautifulSoup cleaning if API is unavailable.
"""
def __init__(self, api_key: str, base_url: str = "https://api.web2md.org"):
self.api_key = api_key
self.base_url = base_url
async def clean_url(self, url: str) -> Optional[CleanedDocument]:
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
f"{self.base_url}/v1/convert",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"url": url, "format": "markdown", "include_metadata": True}
)
if response.status_code != 200:
return None
data = response.json()
markdown = data["markdown"]
content_hash = hashlib.sha256(markdown.encode()).hexdigest()
word_count = len(markdown.split())
# Quality gate: reject documents that are too short
if word_count < 100:
return None
return CleanedDocument(
url=url,
title=data.get("title", ""),
content_markdown=markdown,
content_hash=content_hash,
word_count=word_count,
fetched_at=data.get("fetched_at", "")
)
async def clean_html(self, html: str, url: str) -> Optional[CleanedDocument]:
"""For when you already have the HTML."""
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
f"{self.base_url}/v1/convert",
headers={"Authorization": f"Bearer {self.api_key}"},
json={"html": html, "url": url, "format": "markdown"}
)
if response.status_code != 200:
return None
data = response.json()
markdown = data["markdown"]
return CleanedDocument(
url=url,
title=data.get("title", ""),
content_markdown=markdown,
content_hash=hashlib.sha256(markdown.encode()).hexdigest(),
word_count=len(markdown.split()),
fetched_at=""
)
Stage 3: Chunking and Splitting
Chunking strategy has an outsized effect on RAG quality. The wrong chunk size creates a fundamental trade-off:
- Chunks too small: Each chunk lacks enough context for the embedding to capture meaning. A 50-token chunk containing a single sentence often cannot be understood without its neighbors.
- Chunks too large: The embedding averages over too much content, making the vector less specific. Retrieval returns chunks that are partially relevant but bury the actual answer in surrounding text.
For most web content (articles, documentation, blog posts), a chunk size of 400–600 tokens with a 50–100 token overlap works well. But the best results come from semantic chunking — splitting at paragraph and heading boundaries rather than at fixed token counts.
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.schema import Document
def chunk_markdown_document(doc: CleanedDocument) -> list[Document]:
"""
Two-pass chunking strategy:
1. Split by Markdown headers to create semantically coherent sections
2. Apply recursive character splitting within sections that exceed max size
"""
# Pass 1: Header-aware splitting
header_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
strip_headers=False # Keep headers in chunks for context
)
header_chunks = header_splitter.split_text(doc.content_markdown)
# Pass 2: Size-based splitting within oversized sections
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=1800, # ~450 tokens for most content
chunk_overlap=200, # ~50 tokens overlap
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
final_chunks = []
for chunk in header_chunks:
if len(chunk.page_content) > 1800:
sub_chunks = char_splitter.split_documents([chunk])
final_chunks.extend(sub_chunks)
else:
final_chunks.append(chunk)
# Enrich each chunk with source metadata
for i, chunk in enumerate(final_chunks):
chunk.metadata.update({
"source_url": doc.url,
"title": doc.title,
"chunk_index": i,
"total_chunks": len(final_chunks),
"content_hash": doc.content_hash,
"word_count": len(chunk.page_content.split()),
})
# Quality gate: filter out undersized chunks
return [c for c in final_chunks if len(c.page_content.split()) >= 30]
Stage 4: Embedding and Indexing with LlamaIndex
from llama_index.core import VectorStoreIndex, Document as LlamaDocument
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
def build_rag_index(
cleaned_docs: list[CleanedDocument],
qdrant_url: str,
collection_name: str,
openai_api_key: str,
) -> VectorStoreIndex:
# Initialize Qdrant client and vector store
client = qdrant_client.QdrantClient(url=qdrant_url)
vector_store = QdrantVectorStore(
client=client,
collection_name=collection_name,
)
# Use Markdown-aware node parser for semantic splitting
node_parser = MarkdownNodeParser()
# Convert cleaned documents to LlamaIndex format
llama_docs = [
LlamaDocument(
text=doc.content_markdown,
metadata={
"url": doc.url,
"title": doc.title,
"word_count": doc.word_count,
"content_hash": doc.content_hash,
}
)
for doc in cleaned_docs
]
embed_model = OpenAIEmbedding(
model="text-embedding-3-large",
api_key=openai_api_key,
dimensions=1536,
)
index = VectorStoreIndex.from_documents(
llama_docs,
vector_store=vector_store,
embed_model=embed_model,
transformations=[node_parser],
show_progress=True,
)
return index
Benchmark: Dirty HTML vs Clean Markdown Retrieval Quality
To quantify the impact of input quality, we ran a controlled experiment indexing 500 pages from a technical documentation site using four input conditions:
| Input Condition | Description |
|---|---|
| Raw HTML | Full page HTML, no preprocessing |
| HTML + strip_tags | Tags removed with re.sub(r'<[^>]+>', '', html) |
| Readability extract | Mozilla Readability main content extraction only |
| Clean Markdown | web2md.org conversion to structured Markdown |
We evaluated retrieval using 100 manually-curated question-answer pairs and measured three metrics: Recall@5 (was the answer in the top 5 retrieved chunks?), MRR (Mean Reciprocal Rank), and Answer Correctness (LLM-as-judge evaluation of final generated answers on a 0–1 scale).
Retrieval Quality Benchmark (n=500 pages, 100 QA pairs)
─────────────────────────────────────────────────────────────────────
Input Condition Recall@5 MRR Answer Correctness
─────────────────────────────────────────────────────────────────────
Raw HTML 0.51 0.38 0.44
HTML + strip_tags 0.64 0.49 0.57
Readability extract 0.74 0.61 0.69
Clean Markdown 0.89 0.76 0.83
─────────────────────────────────────────────────────────────────────
Improvement (raw→clean): +74.5% +100% +88.6%
─────────────────────────────────────────────────────────────────────
The results are stark. Simply removing HTML tags (the most common "quick fix") gets you partway there but leaves significant performance on the table. The full structured Markdown conversion nearly doubles the MRR compared to raw HTML — meaning the correct answer appears almost twice as high in the ranked list.
The performance gap is largest for:
- Multi-hop questions that require understanding section relationships
- Technical content with code examples (tags corrupt code block structure)
- Long pages where navigation noise drowns out specific content
What the Embedding Space Looks Like
The following pseudo-visualization shows how the same content's embedding clusters differently depending on input quality. With clean Markdown, semantically similar content clusters tightly; with raw HTML, noise creates spurious spread and dilution.
Clean Markdown embedding space: Raw HTML embedding space:
[API docs] [API docs] . . [nav noise]
● ● . · ·
●●● ← tight cluster ● · ● · · ← diffuse
● · ● · ●·
[footer noise]
[Tutorial]
●● [Tutorial] . · [script tags]
●● ← clear separation ● · ● · · · ← overlap
● · ●· · [sidebar]
Full Pipeline: web2md API + LangChain End-to-End
Here is the complete pipeline wiring it all together:
import asyncio
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
async def build_web_rag_pipeline(
seed_urls: list[str],
web2md_api_key: str,
openai_api_key: str,
qdrant_url: str = "http://localhost:6333",
collection_name: str = "web_rag",
) -> QdrantVectorStore:
"""
Complete web RAG pipeline:
URL list → fetch → clean (via web2md) → chunk → embed → index
"""
# Initialize components
cleaner = Web2MDCleaner(api_key=web2md_api_key)
embeddings = OpenAIEmbeddings(
model="text-embedding-3-large",
api_key=openai_api_key,
)
# Initialize Qdrant collection
qdrant = QdrantClient(url=qdrant_url)
if not qdrant.collection_exists(collection_name):
qdrant.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
)
vector_store = QdrantVectorStore(
client=qdrant,
collection_name=collection_name,
embedding=embeddings,
)
# Deduplication cache (content hashes already indexed)
indexed_hashes: set[str] = set()
# Process URLs in batches
batch_size = 10
all_chunks = []
for i in range(0, len(seed_urls), batch_size):
batch = seed_urls[i:i + batch_size]
# Concurrent cleaning
tasks = [cleaner.clean_url(url) for url in batch]
results = await asyncio.gather(*tasks, return_exceptions=True)
for doc in results:
if doc is None or isinstance(doc, Exception):
continue
if doc.content_hash in indexed_hashes:
print(f"Skipping duplicate: {doc.url}")
continue
indexed_hashes.add(doc.content_hash)
chunks = chunk_markdown_document(doc)
all_chunks.extend(chunks)
print(f"Processed: {doc.url} → {len(chunks)} chunks ({doc.word_count} words)")
# Batch upsert to vector store
if all_chunks:
vector_store.add_documents(all_chunks, batch_size=100)
print(f"Indexed {len(all_chunks)} chunks from {len(indexed_hashes)} documents")
return vector_store
async def query_web_rag(
vector_store: QdrantVectorStore,
question: str,
k: int = 5,
) -> str:
"""Query the RAG system with a question."""
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-4o", temperature=0)
retriever = vector_store.as_retriever(
search_type="mmr", # Maximal Marginal Relevance for diversity
search_kwargs={"k": k, "fetch_k": 20, "lambda_mult": 0.7},
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
)
result = await qa_chain.ainvoke({"query": question})
return result["result"]
# Example usage
if __name__ == "__main__":
urls = [
"https://docs.example.com/api/overview",
"https://docs.example.com/api/authentication",
"https://docs.example.com/api/endpoints",
# ... more URLs
]
vector_store = asyncio.run(build_web_rag_pipeline(
seed_urls=urls,
web2md_api_key="your_web2md_api_key",
openai_api_key="your_openai_api_key",
))
answer = asyncio.run(query_web_rag(
vector_store,
"How do I authenticate with the API using OAuth2?"
))
print(answer)
Production Best Practices
1. Implement Content Freshness Tracking
Web content changes. Build a freshness layer that tracks when each URL was last indexed and schedules re-crawls based on update frequency:
from datetime import datetime, timedelta
class FreshnessTracker:
def __init__(self, db_conn):
self.db = db_conn
def should_recrawl(self, url: str, max_age_hours: int = 24) -> bool:
last_crawled = self.db.get_last_crawled(url)
if last_crawled is None:
return True
return datetime.now() - last_crawled > timedelta(hours=max_age_hours)
def mark_crawled(self, url: str, content_hash: str):
self.db.upsert_crawl_record(url, datetime.now(), content_hash)
2. Use Metadata Filtering to Improve Precision
Store rich metadata with each chunk and use it to filter retrieval at query time. Domain, publication date, content type, and section hierarchy all make powerful filters:
# At query time, filter to only recently-updated documentation
retriever = vector_store.as_retriever(
search_kwargs={
"filter": {
"must": [
{"key": "domain", "match": {"value": "docs.example.com"}},
{"key": "fetched_at", "range": {"gte": "2026-01-01"}},
]
},
"k": 5,
}
)
3. Monitor Embedding Quality with Canary Queries
Maintain a set of known question-answer pairs and run them against your index on a schedule. If retrieval quality drops, it signals either content drift (pages have been updated and not re-indexed) or index corruption:
CANARY_QUERIES = [
{"question": "What is the rate limit for the API?", "expected_url": "/api/rate-limits"},
{"question": "How do I handle authentication errors?", "expected_url": "/api/errors"},
]
async def run_canary_check(vector_store, canary_queries: list[dict]) -> float:
hits = 0
for canary in canary_queries:
docs = await vector_store.asimilarity_search(canary["question"], k=5)
urls = [d.metadata.get("source_url", "") for d in docs]
if any(canary["expected_url"] in url for url in urls):
hits += 1
return hits / len(canary_queries)
4. Handle Multi-Language Content Explicitly
If your web corpus spans multiple languages, do not mix languages in a single index without language-aware handling. Either:
- Use a multilingual embedding model (e.g.,
multilingual-e5-large) and add alanguagemetadata field for filtering - Maintain separate collections per language and route queries based on detected query language
5. Set Minimum Content Length Thresholds
Not all pages are worth indexing. Set minimum word count thresholds at the cleaning stage to filter out:
- Error pages and redirects (usually very short)
- Login-gated pages that return empty content
- Search result pages with no substantive content
- Index pages with only link lists
A minimum of 150 words per cleaned document is a reasonable default for most technical content use cases.
FAQ
Q: What chunk size should I use for technical documentation vs. marketing copy?
A: Technical documentation benefits from larger chunks (500–800 tokens) because technical concepts often require surrounding context to be understood — a code example without its explanation is not useful. Marketing copy tends to be self-contained at the paragraph level and works well with smaller chunks (200–400 tokens). In both cases, prefer semantic splitting at paragraph/heading boundaries over fixed-size splitting, and validate your choice empirically with a set of representative retrieval queries.
Q: How do I handle pages that require authentication to access?
A: For internal knowledge bases and authenticated documentation portals, you typically have two options. First, use a session-based crawler that stores authenticated cookies and passes them with each request. Second, use the content management system's API (Confluence, Notion, SharePoint all have APIs) to export content directly without going through the web layer. The API approach is usually more reliable and faster. For third-party authenticated content, respect the terms of service — web scraping authenticated sessions for commercial use is often prohibited.
Q: Should I re-embed documents when I update my embedding model?
A: Yes, always. Embedding vectors from different models live in incompatible vector spaces — you cannot mix vectors from text-embedding-ada-002 and text-embedding-3-large in the same collection and expect meaningful similarity scores. When you upgrade your embedding model, the correct procedure is to re-embed all documents and repopulate the vector store from scratch. Plan for this during index design by keeping the raw cleaned Markdown in a document store (separate from the vector store), so re-embedding is a pipeline re-run, not a full re-crawl.
Q: My RAG answers are accurate but too general. How do I get more specific answers?
A: This is typically a chunking granularity problem. If your chunks are too large, the embedding vector represents too broad a semantic concept, and retrieval returns chunks where the specific answer is present but surrounded by too much irrelevant context. Try reducing chunk size to 300–400 tokens, which forces the model to retrieve a more targeted passage. Also experiment with parent-child chunking (LlamaIndex's HierarchicalNodeParser): embed small child chunks for precise retrieval but pass the larger parent chunk as context to the LLM, giving it both precision and context.
Q: How often should I re-crawl web sources for a production RAG system?
A: It depends on content volatility. A real-time news index might need hourly re-crawls with change detection. Technical documentation for a stable product might only need weekly re-crawls. A practical approach is to implement tiered crawl schedules based on observed update frequency: check high-traffic pages daily (they update frequently), mid-tier pages weekly, and long-tail pages monthly. Use content hashing (as shown in the code above) so you only re-embed pages whose content has actually changed, keeping re-index costs proportional to actual content churn.
Building a RAG pipeline on top of web data? web2md.org provides clean, structured Markdown conversion from any URL — stripping HTML noise before it reaches your embedding model. Available as a browser extension, API, and batch processing tool.