2026-06-08站内改写3 min readUpdated: 2026-06-08

Why Do LLMs Corrupt Your Documents When You Delegate?

A recent study reveals that delegating tasks to LLMs can silently corrupt documents. The DELEGATE-52 benchmark tested 19 models and found that even top models corrupt 25% of content after 20 interactions. Causes include compounding errors, deletion by weak models vs. hallucination by strong ones, context overload, and domain unfamiliarity. Agentic AI tools offer little remediation.

SourceKDnuggetsAuthor: Iván Palomares Carrascosa

--> Why Do LLMs Corrupt Your Documents When You Delegate? - KDnuggets

-->

Join Newsletter

Corruption with Delegation

We are entering a new AI era, in which interaction turns into work delegation. Users not only just chat with an AI that answers their questions: they increasingly delegate long-horizon tasks — from editing source code to formatting professional text or even managing accounting books. Therefore, they trust AI systems at an unprecedented level to maintain the integrity of files like documents across multiple interactions.

However, a recent study revealed a problem. When delegating tasks to a large language model (LLM), it may silently corrupt documents you handed to it. To understand this issue, the scientists in this study, whose findings we summarize, built a rigorous evaluation framework called "DELEGATE-52". This benchmark spans 52 professional domains: from legal text to Python coding, music notation, or crystallography.

The authors tested a total of 19 distinct LLMs using a smart simulation method based on a "round-trip" approach, asking the AI to perform a specific edit, followed by the exact inverse instruction to undo the edits. In an ideal scenario, the model would provide back the original document as it was — totally intact. The reality check: even the smartest models, like Gemini Pro, Claude Opus, and GPT-5, are able to corrupt 25% of the original document content after 20 interactions; weaker models can approach 50%.

Why Models Corrupt Your Documents

Let's analyze several reasons why the previously explained phenomenon of structural content decay may happen. The researchers uncovered several reasons why this happens:

// 1. Errors Compound

Just like in the traditional "telephone game", small errors made by LLMs can quietly compound and become insidiously significant. A single edit may add some sparse, localized errors, but a sequence of complex edits may snowball the issue in the long run, causing drastic document degradation over time.

// 2. Weak Models Delete, Smart Ones Hallucinate

In the study, a striking shift in the way distinct types of models fail is highlighted. Weaker models tend to incur deletion: accidentally dropping content, which makes the issue noticeable after several interactions due to an obvious shrinking in the overall document content. In frontier LLMs, however, the root issue is not deletion but corruption: they keep the documents' overall "look and feel", even maintaining a nearly intact word count, but they silently mistype, modify, or replace factual information with fabrications that still sound plausible. Here's the irony: the smarter the model, the more difficult it becomes to detect its corruptive behavior, as the final output still looks legitimate at first glance.

// 3. Context Overload and Distractor Attachments

In a messy condition — with a lot of context information or excessive attached documents — models struggle to keep information structurally intact. As the document size increases or more "distractor files" are included as part of the prompt context, the severity and impact of degradation skyrockets, losing the grip on accurate details and filling gaps based on predictive logic. The model no longer adheres to the source text, as it finds it easier to just guess.

// 4. The Importance of Domain Familiarity

One last reason why models tend to degrade documents in complex interactions involving delegation relates to the nature of the use case and how familiar the model is with it.

Not all files degrade to the same extent in delegation-based tasks. According to the study, LLMs perform well in highly structured, programmatic domains, such as Python source code. It is when pushed to purely natural language tasks or niche spatial formatting that they quickly lose the strict sense of internal logic needed to keep files totally intact.

Does Agentic AI Help?

Even when LLMs are upgraded by endowing them with agentic tools — such as the ability to execute code or directly read and write files — the problem of delegation-based document corruption and decay does not fade. In fact, agentic add-ons do little to nothing to prevent an issue that takes place at the core of the transformer architecture underlying LLMs. Rethinking how long-horizon AI tasks should be verified is necessary. Until then, using LLMs as fully unsupervised document editors remains a high-risk gamble.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Our Top 5 Free Course Recommendations

-->