convert webpage to markdownweb to markdownhtml to markdowntutorialweb2md

How to Convert Any Webpage to Markdown — A Complete Guide

Web2MD Team2026-03-258 min read

How to Convert Any Webpage to Markdown — A Complete Guide

At some point, every developer who works with AI models runs into the same wall: you need the content of a webpage, but you need it clean. Not the 47KB of HTML with nested divs, tracking scripts, and inline styles. Just the text, the headings, the code blocks, and maybe the images — structured in a format that both humans and machines can parse without friction.

That format is Markdown. And getting web content into Markdown reliably is harder than it sounds.

This guide walks through every practical approach, from manual copy-paste to fully automated pipelines, so you can pick the method that fits your workflow.

Why Convert Web Pages to Markdown?

Three reasons keep coming up in practice:

  1. AI input preparation. LLMs like ChatGPT and Claude process Markdown far more efficiently than raw HTML. We measured a 65% reduction in token costs just by cleaning HTML into Markdown before sending it to the API.

  2. Knowledge management. If you use Obsidian, Notion, Logseq, or any Markdown-based note system, converting web pages means your research lives in one searchable, portable format.

  3. Content migration. Moving a blog from WordPress to a static site generator like Astro or Hugo requires converting existing HTML posts into Markdown files. Doing this manually for 200 posts is not realistic.

The Manual Approach (and Why It Falls Short)

The simplest method is to select all content in the browser, copy it, and paste it into a Markdown editor. This works — sort of. Here is what actually happens:

What you expect:     Clean text with headings and lists
What you get:        Text with hidden formatting characters,
                     broken tables, missing code blocks,
                     and zero semantic structure

Specific problems with copy-paste:

  • Code blocks lose their formatting. Indentation disappears, syntax highlighting context is gone, and inline code merges with surrounding text.
  • Tables become plain text. Column alignment vanishes entirely. A 5-column comparison table turns into a jumbled paragraph.
  • Images are dropped. You get the alt text at best, nothing at worst.
  • Links lose their URLs. You see the anchor text but the actual href is gone unless you manually inspect each link.
  • Hidden characters sneak in. Zero-width spaces, non-breaking spaces, and directional formatting marks from the original HTML survive the copy and silently break downstream tools.

For a single paragraph, copy-paste is fine. For anything structured, you need a better method.

Automated Tools: A Practical Comparison

Three tools handle web-to-Markdown conversion well, each with different tradeoffs.

Web2MD

Web2MD is a browser extension that converts the current page to Markdown with one click. It runs entirely in the browser, which means it can handle JavaScript-rendered content (SPAs, dynamically loaded articles) because it works on the already-rendered DOM rather than the raw HTML source.

Strengths:

  • Handles JS-rendered pages natively (React, Vue, Next.js sites)
  • Preserves code blocks with language detection
  • Shows estimated token count for AI workflows
  • Works offline after installation — no data sent to external servers

Best for: Interactive use, AI prompt preparation, research workflows where you are reading pages one at a time.

Jina Reader

Jina Reader is an API-based service. Prepend https://r.jina.ai/ to any URL and it returns the page content as Markdown. It is excellent for programmatic access:

curl -s "https://r.jina.ai/https://example.com/article" > article.md

Strengths:

  • API-first design, easy to integrate into scripts
  • Handles most static and server-rendered pages
  • No browser extension needed

Limitations:

  • Requires sending your target URL to a third-party server
  • Struggles with heavily JavaScript-dependent pages
  • Rate limits apply on the free tier

Best for: Batch processing, CI/CD pipelines, server-side extraction.

MarkDownload

MarkDownload is an open-source browser extension that saves pages as Markdown files. It uses Turndown under the hood, the most widely-used JavaScript library for HTML-to-Markdown conversion.

Strengths:

  • Open source and well-maintained
  • Customizable output via Turndown rules
  • Saves directly to .md files

Limitations:

  • Less focused on AI workflows (no token counting)
  • Output sometimes includes navigation elements and sidebar content
  • Configuration requires some familiarity with Turndown options

Best for: Saving articles for offline reading, building a personal knowledge base in Obsidian or similar tools.

Quick Comparison

| Feature | Web2MD | Jina Reader | MarkDownload | |---|---|---|---| | JS-rendered pages | Yes | Limited | Yes | | API access | No | Yes | No | | Token counting | Yes | No | No | | Offline capable | Yes | No | Yes | | Content filtering | Yes | Partial | Partial | | Open source | No | No | Yes |

Best Practices for Clean Conversion

Regardless of which tool you use, these practices help ensure your Markdown output is actually useful.

Preserve Code Blocks Correctly

The most common conversion failure is code blocks losing their language annotation. Good Markdown conversion should produce fenced code blocks with the language specified:

```python
def fetch_page(url):
    response = requests.get(url)
    return response.text
```

If your tool outputs code blocks without the language tag, downstream syntax highlighters and AI models lose context. Web2MD detects the language from the original HTML's class attribute on <code> elements (e.g., class="language-python") and adds it automatically.

Handle Images Thoughtfully

Images in web-to-Markdown conversion present a choice: do you want the image URLs (which may break when the source page changes), or just the alt text? For AI workflows, alt text is usually sufficient since LLMs cannot process images in a text prompt anyway. For archival purposes, you may want to download images locally and update the paths.

A pragmatic approach:

<!-- For AI input: alt text is enough -->
![Chart showing 65% token reduction with clean Markdown input]

<!-- For archival: use local paths -->
![Token reduction chart](./images/token-reduction-chart.png)

Deal with JavaScript-Rendered Pages

Many modern sites render content client-side. A naive curl or fetch request returns a nearly empty HTML shell with a <div id="root"></div> and a pile of JavaScript. Server-side conversion tools (including Jina Reader) sometimes fail on these pages.

Browser-based tools like Web2MD sidestep this entirely because they operate on the DOM after JavaScript has executed. If you need programmatic access to JS-rendered pages, use a headless browser like Puppeteer or Playwright to render first, then convert:

const puppeteer = require('puppeteer');
const TurndownService = require('turndown');

async function convertPage(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  const html = await page.evaluate(
    () => document.querySelector('article')?.innerHTML || document.body.innerHTML
  );

  const turndown = new TurndownService();
  const markdown = turndown.turndown(html);

  await browser.close();
  return markdown;
}

This approach works but adds complexity. For most users, a browser extension is the simpler path.

Follow the CommonMark Spec

When customizing conversion rules or writing your own converter, stick to the CommonMark specification. It is the closest thing Markdown has to a formal standard, and most tools and renderers follow it. Avoid GitHub-Flavored Markdown (GFM) extensions like strikethrough or task lists unless your target platform explicitly supports them.

Common Use Cases

Here is where web-to-Markdown conversion provides the most value in practice:

Research and analysis. Academics and analysts regularly need to process dozens of web sources. Converting to Markdown first makes it easy to feed content to AI for summarization or comparison. See our guide on using AI for academic research workflows for a detailed walkthrough.

RAG pipelines. Retrieval-Augmented Generation systems need clean text chunks. Markdown provides natural splitting points (headings, paragraphs) that produce better chunks than arbitrary character-count splits on raw HTML. The structured hierarchy means your retrieval system can prioritize heading-level matches over body text.

Content auditing. Marketing teams converting competitor pages to Markdown can quickly diff content changes over time, extract structured data, and feed it to AI for competitive analysis — all without manually reading through each page.

Documentation migration. Moving from a wiki or CMS to a static site generator requires bulk conversion. Automating the HTML-to-Markdown step saves days of manual reformatting. Tools like Turndown can be scripted with custom rules to handle site-specific markup patterns.

AI prompt engineering. When building prompts that reference external content, clean Markdown drastically outperforms raw HTML in both token efficiency and response quality. The structure helps the model understand document hierarchy without parsing noise.

Conclusion

Converting web pages to Markdown is a solved problem — but the quality of the solution matters enormously. A sloppy conversion that drops code blocks, loses links, and includes navigation chrome is worse than no conversion at all, because it gives you false confidence that the content is clean.

Pick a tool that matches your workflow: Web2MD for interactive browser use, Jina Reader for API pipelines, or Turndown for custom scripted conversion. Validate your output on a few representative pages before committing to a process. And always check that code blocks, tables, and links survived the conversion intact.

The ten minutes you spend setting up a clean conversion pipeline will save hours of debugging bad AI outputs downstream.

Related Articles