2026-06-17站内改写3 min readUpdated: 2026-06-17

The AI-native document format

DocLang is an open standard for machine-readable documents designed to be AI-native, preserving structure, semantics, and metadata for LLMs and VLMs. It aims to replace formats like PDF and DOCX that were built for rendering, not understanding.

SourceHacker News AIAuthor: taubek

DocLang

Open standard · Joint Development Foundation Project

The AI-native document format.

PDF was built for print. DOCX was built for editors. DocLang is built for what comes next — a machine-readable document standard your models can actually trust.

Read the spec View on GitHub →

Founded by

Your documents are lying to your models.

The world's knowledge lives in formats designed for rendering, not understanding. Markdown was built for readers. HTML for browsers. LaTeX for typesetting. PDF for print. None were built for machines.

Modern AI pipelines assume clean, structured input. Real-world documents — contracts, invoices, research papers, regulatory filings — are none of those things. Parsers guess at reading order. Tables become flat text. Figures vanish. Metadata is stripped.

The result: your model's accuracy is bottlenecked by document quality, not model quality. You spend more engineering time wrangling pre-processing than building the product.

parse("quarterly_report.pdf")

✕ reading_order

expected sequential hierarchy

received undefined

✕ table_structure

expected 3×12 grid with merged cells

received flat string (156 chars)

✕ figure_references

expected 8 embedded figures

received 0 (omitted)

✕ document_metadata

expected { author, created, lang }

received null

A document representation built for how AI actually reads.

DocLang defines a structured, machine-readable format for documents of any type. Not a converter. Not an API. A standard — like JSON for data, like HTML for the web — that any tool can implement and any pipeline can consume.

Every component carries a semantic tag, bounding box coordinates, and reading order — natively encoded in a format LLM tokenizers can parse without translation overhead. A table encodes its full grid structure via OTSL. A heading carries its level and page position. Your model doesn't have to guess. Governance metadata — PII flags, RAG permissions, training constraints — lives inside , not in a sidecar file.

The same standard extends beyond text documents. Audio transcripts, images, and video segments encode as first-class elements — speakers, timestamps, and scenes using the same primitives as headings and tables.

Q3 2024Financial Re port Net Revenue42 M51M39M Figure3.2 omitted author:null

J. Smith Q3 2024 Revenue$42M

Six properties. No compromises.

AI-native

Every element maps directly to LLM tokens. No translation layers, no postprocessing, no structural guesswork.

Lossless

Tables keep their full grid structure. Figures keep their position. Reading order is preserved, not inferred.

Expressive

Semantic roles, bounding boxes, document hierarchy — all fully encoded. Your model stops hallucinating structure.

Beyond documents

Audio transcripts, images, video segments — same format, same primitives. Speakers, timestamps, and scenes are native elements.

Unambiguous

One canonical representation per content type. No parser-dependent variance. Every tool produces the same output.

Open

A Joint Development Foundation Projects standard and LF AI & Data project. Public spec, open working group, no lock-in.

The business context layer for enterprise AI.

AI is only as reliable as the context it receives. DocLang transforms documents into structured business context that can be trusted across AI agents, workflows, and enterprise systems.

Business context, preserved

Structure alone is not enough. DocLang preserves the meaning, relationships, and business context behind your documents so AI systems can act on knowledge, not just content.

Fewer errors, faster decisions

Reliable structure means fewer errors in automated document workflows — fewer manual reviews, lower compliance exposure, and faster time-to-decision.

Audit-ready by default

Compliance metadata travels with the document, not alongside it. Legal and compliance teams define rules once, and every downstream system reads them automatically.

No lock-in, ever

Swap components as the market evolves. Your documents stay portable because the speicification is standardized and any vendor can implement it.

AI-native document format specification

DocLang is a constrained XML format built from the ground up for LLM tokenizers — a 1-to-1 mapping between DocLang tokens and model tokens, with minimal token count. Every component carries semantic role, geometric bounding box, and reading order. Tables use OTSL: 5 structural tokens where HTML needs 28.

Full spec and reference implementation on GitHub →

Q3 Financial Summary

QuarterRevenueYoY Q3 2024$42M+18%

Is this just another document parser?

No. Parsers convert documents into some proprietary output format. DocLang is a standard — a shared specification that any parser, any converter, any AI tool can implement. The goal is interoperability, not another tool to integrate.

What's wrong with just using Docling / FineReader / MyParser instead of DocLang?

Nothing — that's the point. Enable the DocLang output and the structured data they produce stops being tied to your specific pipeline configuration. It becomes consumable by any downstream system that speaks the DocLang standard.

Docling and ABBYY FineReader Engine already natively support the DocLang standard.

How is this different from what PDF already contains?

PDF is a presentation format. It tells a renderer where to draw pixels. DocLang is a semantic format — it tells a model what content is. A PDF table and a DocLang table are fundamentally different objects.

Who governs the spec?

The DocLang Specification development process is governed by Joint Development Foundation Projects. DocLang is an LF AI & Data project. The DocLang working group — founded by IBM, NVIDIA, Red Hat, ABBYY, and HumanSignal — proposes and reviews changes, but the foundation ensures the process remains open and no single vendor controls the roadmap.

Can I contribute?

Yes. The spec, the reference implementation, and the working group processes are all public. Join the GitHub discussion, open an issue, or attend a working group session. The standard improves when more perspectives are in the room.

The substrate your pipeline has been missing.

If you build with LLMs and VLMs on real-world content, DocLang is the substrate you've been missing. Free, open, and ready to use — read the spec and see what your models have been working around.

Read the spec View on GitHub → Join the working group