The article discusses how AI agents fail because they cannot access the majority of unstructured data within organizations. It introduces Unstructured as a platform that processes over 65 file types, extracts, chunks, enriches, and embeds data into Databricks lakehouses, enabling agents to access full business context while maintaining governance through Unity Catalog.
80% of organizational knowledge is locked in unstructured data.
Unstructured provides a single pipeline for extraction, chunking, enrichment, and embedding.
The Naval Sea Systems Command awarded Unstructured a contract to design an AI-enabled solution that helps warfighters surface mission-critical information faster, reduce operator workload, and accelerate decision-making in Anti-Submarine Warfare and surface warfare operations. The solution integrates Unstructured's data ingestion with Elastic's enterprise search to mine heterogeneous data sources, initially deployed on CV-TSC and USW-DSS systems with future applicability to JADC2 and C5ISR.
Unstructured awarded NAVSEA contract to develop AI solution for integrating fragmented data to accelerate fleet decisions.
Solution combines Unstructured's data ingestion and Elastic's enterprise search to support ASW and surface warfare.
Unstructured introduces Extract, a new enrichment node that extracts structured JSON data from documents using LLM or regex, enabling intelligent document processing within existing workflows.
Extract node allows defining a schema to extract structured records from documents. Supports LLM-based extraction for understanding tasks and regex-based extraction for pattern tasks.
Runs within the existing Unstructured workflow, producing DocumentData elements while preserving other nodes' outputs.
Unstructured launches webhooks to automate downstream actions based on job lifecycle events, allowing integration with any endpoint via workspace or workflow scopes.
Webhooks fire on five job events: scheduled, in_progress, stopped, failed, completed.
Two scopes: workspace-scoped for all jobs, workflow-scoped for specific workflows.
Unstructured found that combining high-quality datasets with incompatible annotation styles degraded model performance. They built an agentic harmonization pipeline using a VLM to reconcile label differences, resulting in improved metrics across 14 of 17 benchmarks.
Annotation inconsistency in training data can cause model performance to degrade even when data volume increases.
Unstructured developed an agentic label harmonization workflow using a VLM to reconcile conflicting annotations before training.
Unstructured's SCORE-Bench benchmark evaluates five frontier models on enterprise document parsing, revealing a significant gap between raw model calls and optimized pipelines. While models excel in reasoning and hallucination control (especially Claude Opus 4.6), they lag by up to 23 percentage points in table extraction, document structure, and output consistency. The gap is attributed to configuration rather than capability, and can be closed via optimized prompting, post-processing, and output structure enforcement.
Claude Opus 4.6 hallucination rate (0.044) nearly matches pipeline (0.043), but recall is lowest (0.737), missing ~25% of content.
All models score up to 23 points lower on table extraction, risking structurally misplaced data.
Unstructured's new guide on advanced Retrieval-Augmented Generation (RAG) techniques, covering smart chunking, metadata filtering, GraphRAG, hybrid search, and agentic workflows, aimed at building scalable enterprise AI pipelines.
The guide explains why naive RAG fails and how to fix it
It covers smart chunking strategies (by title, similarity, structure-aware)
Unstructured announces major updates including a simplified drag-and-drop interface, Generative Refinement for higher fidelity outputs, and simplified pricing with a free tier. The new workflow combines high-resolution partitioning with VLM-powered enrichments to achieve superior accuracy and structure preservation.
New Start-page drag-and-drop allows processing documents in three clicks with visual previews and bounding boxes.
Generative Refinement uses VLM post-processing to improve OCR, tables, and images, reducing hallucinations.