2026-06-19站内改写5 min readUpdated: 2026-06-19

PageToMD – A CLI tool to turn web pages into clean Markdown for AI agents

PageToMD is a CLI tool that converts any webpage into clean, LLM-ready Markdown with YAML frontmatter. It supports static and JS-rendered pages, offers robust features like retries, robots.txt respect, and atomic writes, and is designed for AI agent workflows such as RAG and LLM prompting.

SourceHacker News AIAuthor: gs202

Notifications You must be signed in to change notification settings

Fork 0

Star 0

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

92 Commits

.github

scripts

src/pagetomd

tests

.gitignore

.markdownlint.jsonc

.pre-commit-config.yaml

.python-version

.secrets.baseline

CHANGELOG.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

LICENSE

README.md

pyproject.toml

uv.lock

Repository files navigation

Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.

Why

AI-ready by default. Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt.

Full-fidelity metadata. Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more "where did this Markdown come from?".

Static fast, JS-capable when needed. Default httpx fetcher is sub-second; opt-in playwright extra (or --fetcher auto) handles SPA shells without bloating the install for everyone else.

Stable, scriptable CLI. Typer-built, full env-var precedence (PAGETOMD_*), stable exit codes (0/2/3/4/5/64/130), structured logs (--log-json), and a --no-fetched-at switch for byte-deterministic output.

Not pandoc or curl + sed. pandoc doesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolled curl | html2md pipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes. pagetomd is one command for the whole pipeline.

Install

With pipx (recommended for CLI use)

pipx install pagetomd

optional: enable JS rendering for SPAs

pipx inject pagetomd playwright && playwright install chromium

With uv

uv tool install pagetomd

optional: enable JS rendering for SPAs

uv tool install 'pagetomd[playwright]' && playwright install chromium

Without installing (uv run)

Core — no install required

uv run --with pagetomd pagetomd https://example.com

With Playwright for SPA / JS-heavy pages (install Chromium once first)

uv run --with playwright playwright install chromium uv run --with 'pagetomd[playwright]' pagetomd https://example.com --fetcher auto

With pip

pip install pagetomd # core pip install 'pagetomd[playwright]' # + SPA support

Quick start

Default: derives output filename from the page title

pagetomd https://example.com/blog/post

Stream to stdout (pipe into LLMs, etc.)

pagetomd https://example.com/blog/post -o -

Deterministic output (omits fetched_at — good for snapshot tests / RAG ingestion)

pagetomd https://example.com/blog/post --no-fetched-at -o post.md

Auto-detect SPA pages and fall back to headless Chromium

pagetomd https://my-spa.example.com -o - --fetcher auto

Cookbook

Pipe into an LLM

-o - writes the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe:

pagetomd https://example.com/blog/post -o - | llm "summarise this article in five bullet points"

Batch-convert from a file

while read -r url; do pagetomd "$url" done redirects. No JavaScript execution — if the server sends an empty

shell, that's all you get.

playwright — Renders the page in headless Chromium, waits for network idle, then serialises the live DOM (including shadow roots). Use this when you know the page is a SPA. Requires the optional playwright extra (pip install 'pagetomd[playwright]') and a one-time playwright install chromium. Slower and heavier than httpx, but the only way to get content that lives behind a JS framework.

auto — Fetches with httpx first, then inspects the result: if the text is under 200 characters and the HTML contains SPA markers (data-reactroot,

, a "you need to enable javascript" noscript tag, etc.), it re-fetches with Playwright. A second safety net fires if httpx returned HTML that looked non-empty but the extractor still couldn't pull any content — Playwright gets a shot then too. If Playwright is unavailable, the page is counted as "empty" in the crawl summary rather than a hard failure. Best choice when you're pointed at an unfamiliar URL.

Single page vs. crawl

Use the default single-page mode when you have a specific URL (or a short list piped through a while read loop). Use --crawl when you want every page under a URL prefix — it discovers links automatically, deduplicates, mirrors the URL hierarchy on disk, and reuses a single fetcher context so Playwright doesn't relaunch Chromium per page. See the crawl cookbook recipe for the full flag set.

Output shape

Running pagetomd http://127.0.0.1:8765/blog.html --no-fetched-at -o - against the blog.html fixture prints (first ~15 lines shown):

--- url: http://127.0.0.1:8765/blog.html final_url: http://127.0.0.1:8765/blog.html title: Why We Rewrote Our Build System in Rust author: Jane Doe date: '2024-08-14' description: A retrospective on migrating our monorepo build pipeline from Python to Rust, and what we learned along the way. site_name: Example Engineering Blog language: en tool: pagetomd tool_version: 0.4.0 ---

Why We Rewrote Our Build System in Rust

Three years ago, our monorepo build pipeline was a sprawling Python application held together with shell scripts and prayer. ...

When fetched_at is enabled (the default), an extra fetched_at: '2026-06-15T12:34:56Z' line is included in the frontmatter. Fields whose value cannot be detected (e.g. language, author) are omitted from the YAML.

Common options

A compact overview — see pagetomd --help for the full list.

Flag Default Description

--output / -o derived from title Output path, or - for stdout.

--overwrite false Replace an existing destination file.

--follow-symlinks / --no-follow-symlinks false Allow writes to a symlinked destination. Off by default so --overwrite cannot be tricked into clobbering a file outside the intended directory via a symlink.

--fetcher httpx httpx, playwright, or auto.

--timeout 30.0 Per-request HTTP timeout (seconds).

--retries 4 Per-page retry attempts on transient failures (default 4 = up to 5 total attempts). Honours the server's Retry-After header on 429/503 responses, capped at 5 minutes; falls back to exponential backoff otherwise.

--user-agent pagetomd/ Override the outbound User-Agent.

--no-verify-ssl false Disable TLS certificate verification (for corporate proxies that re-sign HTTPS).

--respect-robots / --no-respect-robots true Honour robots.txt (relaxed for loopback/RFC 1918).

--max-redirects 10 Cap on the redirect chain length.

--include-comments / --no-include-comments false Preserve HTML comments in the extracted document.

--include-images / --no-include-images true Keep image syntax in output.

--include-links / --no-include-links true Keep link URLs in output.

--heading-style atx atx (#) or setext (===).

--code-fences / --no-code-fences true Use fenced code blocks instead of indented ones.

--wide-tables kv Wide-table strategy: kv, html, or drop.

--no-fetched-at false Omit fetched_at for byte-deterministic output.

--log-level info debug, info, warning, error.

--log-json false Emit logs as JSON lines on stderr.

--debug false Shortcut for --log-level=debug + tracebacks on error.

--playwright-idle-ms 500 Extra wait (ms) after networkidle for late-firing scripts (Playwright fetcher only).

--crawl false Crawl all linked sub-pages under the seed URL's path prefix and write one .md file per page. Requires -o to be a directory.

--crawl-depth 1 Maximum BFS depth from the seed URL when --crawl is active. 0 = seed only.

--retry-failed / --no-retry-failed true After --crawl finishes, retry pages that failed in the initial pass once.

--version — Print the installed version and exit.

Environment variables

Every flag has a PAGETOMD_ equivalent. For example:

PAGETOMD_TIMEOUT=60 PAGETOMD_FETCHER=auto pagetomd https://example.com

CLI flags always override env vars; env vars override the built-in defaults.

Exit codes

Code Meaning

0 Success.

1 Unexpected internal error.

2 Fetch failure (DNS, HTTP, robots.txt, redirect cap).

3 Extraction or conversion failure (empty body, malformed HTML).

4 Output write failure (permissions, disk, atomic-rename clash).

5 Missing optional dependency (e.g. playwright not installed).

64 Usage or configuration error (bad flag, invalid value).

130 Interrupted by user (Ctrl-C).

How it works

One paragraph plus a diagram of the pipeline:

URL ──► Fetcher ──► Extractor ──► Converter ──► Postprocess ──► Writer (httpx / (BS4 clean (markdownify (NFC, heading (atomic playwright) + trafilatura) + GFM tables) hierarchy, file + URL absolutise) YAML)

The fetcher (httpx by default, playwright for SPAs) downloads the page with retries and robots.txt enforcement. The extractor runs a BeautifulSoup pre-clean pass (strip scripts/styles/nav/ads) then hands the cleaned tree to trafilatura to identify main content and harvest metadata. The converter renders the surviving HTML to Markdown via a customised markdownify subclass (ATX headings, fenced code blocks with language hints, GFM tables with wide-table fallbacks). The postprocessor enforces the AI-readiness contract (NFC, zero-width strip, monotonic heading hierarchy, absolute URLs). The writer prepends a YAML frontmatter block and writes atomically (or streams to stdout).

Security

pagetomd is a public-URL-only tool. It refuses to fetch private, loopback, link-local, multicast, reserved, or cloud-metadata addresses by default — and there is no flag to override that. Treat output files as having the same sensitivity as the URL they were fetched from.

Quality gates

CI enforces both a project-wide test coverage floor of 85% and a per-module floor of 90% (line + branch combined) on the four critical modules — extractor, converter, writer, and postprocess. These four carry the AI-readiness contract, so they get the strictest coverage bar.

Contributing

git clone https://github.com/gs202/PageToMD.git cd pagetomd uv sync --extra dev --extra playwright pre-commit install uv run pytest

See CONTRIBUTING.md for the full contributor workflow.

License

Business Source License 1.1 — source-available, free for non-commercial use. Converts to MIT on 2030-06-16. See LICENSE for full terms.

About

Convert web pages to clean Markdown

Topics

html-to-markdown

ai-agents

cli-tool

rag

playwright

context-window

llm-context

Resources

Readme

License

View license

Code of conduct

Contributing

Security policy

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

0 stars

Watchers

0 watching

Forks

0 forks

Report repository

Releases 4

v0.4.0

Latest

Jun 18, 2026

+ 3 releases

Packages 0

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 100.0%