The AI-native document format
DocLang is an open standard for machine-readable documents designed to be AI-native, preserving structure, semantics, and metadata for LLMs and VLMs. It aims to replace formats like PDF and DOCX that were built for rendering, not understanding.
DocLang
Open standard · Joint Development Foundation Project
The AI-native document format.
PDF was built for print. DOCX was built for editors. DocLang is built for what comes next — a machine-readable document standard your models can actually trust.
Read the spec View on GitHub →
Founded by
Your documents are lying to your models.
The world's knowledge lives in formats designed for rendering, not understanding. Markdown was built for readers. HTML for browsers. LaTeX for typesetting. PDF for print. None were built for machines.
Modern AI pipelines assume clean, structured input. Real-world documents — contracts, invoices, research papers, regulatory filings — are none of those things. Parsers guess at reading order. Tables become flat text. Figures vanish. Metadata is stripped.
The result: your model's accuracy is bottlenecked by document quality, not model quality. You spend more engineering time wrangling pre-processing than building the product.
parse("quarterly_report.pdf")
✕ reading_order
expected sequential hierarchy
received undefined
✕ table_structure
expected 3×12 grid with merged cells
received flat string (156 chars)
✕ figure_references
expected 8 embedded figures
received 0 (omitted)
✕ document_metadata
expected { author, created, lang }
received null
A document representation built for how AI actually reads.
DocLang defines a structured, machine-readable format for documents of any type. Not a converter. Not an API. A standard — like JSON for data, like HTML for the web — that any tool can implement and any pipeline can consume.
Every component carries a semantic tag, bounding box coordinates, and reading order — natively encoded in a format LLM tokenizers can parse without translation overhead. A table encodes its full grid structure via OTSL. A heading carries its level and page position. Your model doesn't have to guess. Governance metadata — PII flags, RAG permissions, training constraints — lives inside , not in a sidecar file.
The same standard extends beyond text documents. Audio transcripts, images, and video segments encode as first-class elements — speakers, timestamps, and scenes using the same primitives as headings and tables.
Q3 2024Financial Re port Net Revenue42 M51M39M Figure3.2 omitted author:null
J. Smith Q3 2024 Revenue$42M
Six properties. No compromises.
AI-native
Every element maps directly to LLM tokens. No translation layers, no postprocessing, no structural guesswork.
Lossless
Tables keep their full grid structure. Figures keep their position. Reading order is preserved, not inferred.
Expressive
Semantic roles, bounding boxes, document hierarchy — all fully encoded. Your model stops hallucinating structure.
Beyond documents
Audio transcripts, images, video segments — same format, same primitives. Speakers, timestamps, and scenes are native elements.
Unambiguous
One canonical representation per content type. No parser-dependent variance. Every tool produces the same output.
Open
A Joint Development Foundation Projects standard and LF AI & Data project. Public spec, open working group, no lock-in.
The business context layer for enterprise AI.
AI is only as reliable as the context it receives. DocLang transforms documents into structured business context that can be trusted across AI agents, workflows, and enterprise systems.
Business context, preserved
Structure alone is not enough. DocLang preserves the meaning, relationships, and business context behind your documents so AI systems can act on knowledge, not just content.
Fewer errors, faster decisions
Reliable structure means fewer errors in automated document workflows — fewer manual reviews, lower compliance exposure, and faster time-to-decision.
Audit-ready by default
Compliance metadata travels with the document, not alongside it. Legal and compliance teams define rules once, and every downstream system reads them automatically.
No lock-in, ever
Swap components as the market evolves. Your documents stay portable because the speicification is standardized and any vendor can implement it.
AI-native document format specification
DocLang is a constrained XML format built from the ground up for LLM tokenizers — a 1-to-1 mapping between DocLang tokens and model tokens, with minimal token count. Every component carries semantic role, geometric bounding box, and reading order. Tables use OTSL: 5 structural tokens where HTML needs 28.
Full spec and reference implementation on GitHub →
Q3 Financial Summary
QuarterRevenueYoY Q3 2024$42M+18%
Is this just another document parser?
No. Parsers convert documents into some proprietary output format. DocLang is a standard — a shared specification that any parser, any converter, any AI tool can implement. The goal is interoperability, not another tool to integrate.
What's wrong with just using Docling / FineReader / MyParser instead of DocLang?
Nothing — that's the point. Enable the DocLang output and the structured data they produce stops being tied to your specific pipeline configuration. It becomes consumable by any downstream system that speaks the DocLang standard.
Docling and ABBYY FineReader Engine already natively support the DocLang standard.
How is this different from what PDF already contains?
PDF is a presentation format. It tells a renderer where to draw pixels. DocLang is a semantic format — it tells a model what content is. A PDF table and a DocLang table are fundamentally different objects.
Who governs the spec?
The DocLang Specification development process is governed by Joint Development Foundation Projects. DocLang is an LF AI & Data project. The DocLang working group — founded by IBM, NVIDIA, Red Hat, ABBYY, and HumanSignal — proposes and reviews changes, but the foundation ensures the process remains open and no single vendor controls the roadmap.
Can I contribute?
Yes. The spec, the reference implementation, and the working group processes are all public. Join the GitHub discussion, open an issue, or attend a working group session. The standard improves when more perspectives are in the room.
The substrate your pipeline has been missing.
If you build with LLMs and VLMs on real-world content, DocLang is the substrate you've been missing. Free, open, and ready to use — read the spec and see what your models have been working around.
Read the spec View on GitHub → Join the working group