2026-05-20原文2 min readUpdated: 2026-06-12

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

This paper presents a microservice architecture that encapsulates pipelines for classification, OCR, and LLM-based structured field extraction, sharing production experience handling thousands of multi-page documents per hour. Key design decisions include hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, asynchronous processing, and independent horizontal scaling. Batch profiling reveals that OCR dominates end-to-end latency, and system saturation is determined by shared GPU-inference capacity rather than worker count.

SourcearXiv AIAuthor: Yao Fehlis, Benjamin Bengfort, Zhangzhang Si, Vahid Eyorokon, Prema Roman, Patrick Deziel, Devon Slonaker, Steve Veldman, Ben Johnson, Joyce Rigelo, Michael Wharton, Steve Kramer

[2605.18818] Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

[Submitted on 12 May 2026]

Title:Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

View a PDF of the paper titled Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production, by Yao Fehlis and 11 other authors

View PDF HTML (experimental)

Abstract:Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

Subjects:

Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

Cite as: arXiv:2605.18818 [cs.AI]

(or arXiv:2605.18818v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.18818

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yao Fehlis [view email] [v1] Tue, 12 May 2026 13:07:34 UTC (20 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production, by Yao Fehlis and 11 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.AI

new | recent | 2026-05

Change to browse by:

cs cs.LG cs.SE

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)