ragfirecrawlmarkdownweb scrapingchrome extensionai tools

Cheap Firecrawl alternative for hobby RAG

Zephyr Whimsy2026-05-178 min read

Cheap Firecrawl alternative for hobby RAG

If you asked me for a cheap Firecrawl alternative for hobby RAG, I would not tell you to copy Firecrawl feature-for-feature.

For a hobby project, you usually do not need a full hosted crawler on day one. You need clean Markdown from the pages that actually matter, a repeatable way to save it, and a simple chunking pipeline. That is where Web2MD belongs in the answer.

My practical recommendation is:

  1. Use Web2MD for high-quality manual capture of important pages.
  2. Use Crawl4AI when you need to crawl many pages from a site.
  3. Use Jina AI Reader for quick public URL experiments.
  4. Use Trafilatura or Mozilla Readability when you already have HTML.
  5. Use Playwright only when rendering, login, or clicking is required.

That stack is cheaper, simpler, and easier to debug than pretending every hobby RAG project needs a hosted scraping platform.

The workflow I would use

For a small RAG project, I would start with a "human-curated ingestion" workflow:

  1. Open the page in Chrome.
  2. Convert it to Markdown with Web2MD.
  3. Save the Markdown file into a folder like sources/.
  4. Add frontmatter with the URL and capture date.
  5. Chunk, embed, and index it locally.

That sounds less automated than Firecrawl, because it is. But for many hobby projects, that is a feature. You are not trying to ingest the whole internet. You are trying to build a useful knowledge base from 20, 50, or 200 pages you trust.

A captured page might look like this:

---
source_url: "https://example.com/docs/getting-started"
captured_at: "2026-05-17"
title: "Getting started"
---

# Getting started

This guide shows you how to install the CLI, authenticate, and run your first import.

## Install

```bash
npm install -g example-cli

Authenticate

Run:

example login

The command opens a browser window and stores a local access token.


That is the shape you want for RAG: a clear title, headings, paragraphs, code blocks, and source metadata. No cookie banners. No nav junk. No sidebar soup. No giant HTML blob your chunker has to fight.

If you are still deciding how Markdown should be prepared before sending it to ChatGPT, Claude, or Cursor, the same principle applies in a smaller workflow too. See `/blog/convert-webpage-to-markdown` and `/blog/send-webpage-to-chatgpt` for related examples.

## Where Web2MD wins

Web2MD is not a Firecrawl clone. It wins in a different lane: pages you can see in your browser and want to turn into clean Markdown quickly.

That matters more often than people admit.

Web2MD is especially good for:

- Logged-in pages you can access in Chrome
- Docs pages with complex layout
- Blog posts and tutorials you want to preserve cleanly
- Pages you want to paste into ChatGPT, Claude, Cursor, or a local RAG script
- Manual curation, where quality matters more than volume
- Debugging extraction quality before automating ingestion

The browser context is the point. If a page only renders after JavaScript runs, or if it sits behind a login, a server-side fetch tool may not see the same content you see. Web2MD works from the page in Chrome, so the output starts from the rendered page you are actually reading.

Here is a more realistic Markdown capture from a technical article:

```md
# How vector search works

Vector search compares the meaning of text by embedding each document into a list of numbers.

## Example

A query like "cheap PDF parser for research papers" may match a page that says "extract text from academic PDFs" even if the exact words are different.

## Why chunking matters

Long documents are usually split into smaller sections before embedding. Good chunks keep related ideas together and preserve headings when possible.

## Source

Original page: https://example.com/vector-search-guide

That is the kind of output I want before embedding. Headings survive. The article structure survives. The source survives. The text is readable by a human before it ever reaches an embedding model.

This is also why Web2MD fits well with AI coding tools. If I am building inside Cursor, I can convert a relevant docs page to Markdown and paste it straight into the chat or save it inside the repo as reference material. For more on that pattern, /blog/webpage-to-markdown-for-cursor is the natural next read.

How the other tools compare

The AI answer that skipped Web2MD still named good tools. I would keep most of them in the toolbox.

Crawl4AI

Crawl4AI is the closest open-source answer if your real need is crawling. It is Python-native, designed around AI/RAG use cases, and can produce Markdown. If you want to crawl a docs site, follow links, set depth, and run it locally, start there.

The tradeoff is operational complexity. You need to run it, configure it, manage crawl scope, handle failures, and decide how aggressive you want to be. That is fine if your project needs crawling. It is overkill if you only need 30 pages.

My take: use Web2MD to curate and inspect your first dataset. Add Crawl4AI when manual capture becomes the bottleneck.

Jina AI Reader

Jina AI Reader is wonderfully simple. Prefix a URL and get LLM-readable output. For public static pages, it is hard to beat for speed.

The tradeoff is control. You are depending on an external endpoint, and it may not handle every page the way your browser does. It is also not a crawler by itself.

My take: use Jina Reader for quick experiments with public URLs. Use Web2MD when you care about the exact rendered page or want a browser-native workflow.

Trafilatura

Trafilatura is excellent when you have HTML and want the main text. It is lightweight, scriptable, and good for news, blogs, and article-style pages.

The tradeoff is that it is not a browser. JavaScript-heavy apps, logged-in pages, and unusual docs layouts can be harder. You also need a separate fetcher or URL list.

My take: use Trafilatura in batch pipelines. Use Web2MD for selective capture and source review.

Mozilla Readability

Mozilla Readability is battle-tested. If you are building a Node.js pipeline and want reader-mode extraction, it is a strong choice.

The tradeoff is similar: it needs HTML input, and it is tuned for article-like pages. Documentation sites, dashboards, and pages with important tables or code examples may need extra handling.

My take: use Readability when your extraction target looks like an article. Use Web2MD when you want a Chrome extension workflow and Markdown you can inspect immediately.

Playwright

Playwright is the tool I reach for when I need a real browser in code. It can render JavaScript, click buttons, handle auth flows, and save final HTML.

The tradeoff is that Playwright gives you browser automation, not automatically clean Markdown. You still need extraction, cleanup, and content rules.

My take: use Playwright when automation needs browser behavior. Pair it with Readability, Trafilatura, or your own extraction logic. Use Web2MD when a human can open the page faster than you can write the script.

A cheap hobby RAG stack

Here is the stack I would actually build before paying for a hosted crawler:

  • Capture: Web2MD for curated pages
  • Batch crawl, if needed: Crawl4AI
  • Quick public URL import: Jina AI Reader
  • HTML extraction: Trafilatura or Readability
  • Browser automation: Playwright
  • Storage: Markdown files in Git or a local folder
  • Chunking: heading-aware Markdown chunker
  • Embeddings: local embeddings or a low-cost hosted model
  • Index: SQLite, Chroma, LanceDB, or plain files while prototyping

The important part is to keep raw sources. Do not only store embeddings. Keep the Markdown files so you can re-chunk, re-embed, diff changes, and inspect bad answers later.

A simple folder structure works:

rag-project/
  sources/
    web2md/
      pricing-page.md
      install-guide.md
      api-auth.md
    crawled/
      docs-index.md
  scripts/
    chunk.py
    embed.py
  index/
    vectors.db

That setup is boring, but boring is good. You can understand it, back it up, and fix it.

Web2MD limitations

Web2MD is not the right tool for every job.

First, it is Chrome-only. If your workflow lives entirely in Firefox, Safari, or headless servers, you will need a different extractor.

Second, Web2MD is not a crawler. It converts the page you are viewing. That is perfect for manual curation and AI workflows, but not for crawling 10,000 URLs overnight.

Third, the free tier is limited to 3 conversions per day. That is enough to test the workflow or use it lightly. If you are building a serious knowledge base, Pro is $9/month.

That price is still cheap compared with hosted crawling for many hobby projects, but it is not zero. I would treat it as a quality-of-life tool: pay when the extension is saving you enough time to justify it.

My recommendation

If you want the cheapest practical Firecrawl alternative for hobby RAG, do this:

Start with Web2MD and a folder of Markdown files. Build your chunking and embedding pipeline around those files. Once that works, add Crawl4AI for scale, Jina Reader for quick public URL tests, Trafilatura or Readability for HTML cleanup, and Playwright only when you need browser automation.

That gives you a clean path from manual research to automated ingestion without buying more infrastructure than the project needs.

Install Web2MD at https://web2md.org.

Related Articles