ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training
A study from ByteDance Seed and HKUST shows that training multimodal models with question-answer pairs is far more effective than using text transcription for long document understanding. Their model MMProLong, based on Qwen2.5-VL, outperforms much larger models and remains stable up to 512K tokens. Key findings include that pure OCR training hurts performance, diversity in training lengths matters, and short examples are not necessary.
Article intelligence
Key points
- Question-answer training significantly improves long-document performance, while pure OCR training degrades it.
- MMProLong, trained on only 128K tokens, remains stable at 512K token inputs, outperforming larger models.
- Diversity in document length matters more than focusing on the longest contexts; short examples are not necessary.
- The ability transfers to untrained tasks like long-video understanding.
Why it matters
This matters because question-answer training significantly improves long-document performance, while pure OCR training degrades it.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Multimodal AI models are supposed to handle ever-longer documents, but how they're trained to do so usually stays a trade secret. A new study shows that character recognition as a training task actually hurts performance and that question-answer pairs work far better.
Researchers from ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) studied how image-language models can be trained efficiently on long documents. The result is a model called MMProLong, built on Alibaba's open Qwen2.5-VL, that beats much larger competitors.
Modern multimodal AI models need to handle increasingly long inputs: entire PDF collections of rendered pages, hours of video, or agents that remember their tasks across many steps. AI labs like OpenAI, Google, and Alibaba tout context windows of up to 1 million tokens, capable of holding not just text but thousands of page images or video frames. But according to the authors, technical reports barely reveal what data a model should see and in what mix. Asking questions teaches more than transcribing text At first glance, the study's central finding seems obvious. For a multimodal model to learn to find the right spot in a 100-page document, having it transcribe the text of every page barely helps. It's more effective to ask questions whose answers are buried somewhere in those pages.
[caption id="attachment_55917" align="aligncenter" width="373"] The synthesis pipeline combines OCR parsing, automatic question generation, and re-embedding to extract long-context training examples from real documents. | Image: ByteDance[/caption]
The researchers tested both approaches head-to-head. In one setup, the model had to perform text recognition either across all pages of a document or for a few selected pages, while the remaining pages stayed in context as distractions.
In the other setup, the researchers used a separate model (Seed 2.0 from ByteDance) to generate question-answer pairs for individual sections of a document. The question then went into training alongside the entire document, forcing the model to locate the relevant passage within a long context.
[caption id="attachment_55921" align="aligncenter" width="1135"] Question-answer training (top rows) sharply improves the model's long-document performance, while pure character recognition training (bottom rows) actually makes it worse. Even with extra fine-tuning, the OCR variants don't catch up. | Image: ByteDance[/caption]
Pure text recognition as a training task actually worsened performance compared to the starting point. Question-answer training, on the other hand, brought clear gains. The model only learns to navigate long texts when it has to filter out and categorize information with a specific goal. Diversity beats specialization Three more findings turned up in the experiments. Feeding the model mainly very long documents at the top end of the context window isn't worth it. A broader mix of shorter and longer examples works more reliably. Long-context ability isn't a skill tied to a specific length but requires flexible searching across different distances.
The real bottleneck also turns out to be finding the relevant passage, not reasoning about it. A mix weighted toward extraction tasks with a smaller share of calculation tasks delivered the best results.
The third finding is surprising because it contradicts common practice with text-only language models. Adding short training examples doesn't appear strictly necessary. The model largely kept its short-task abilities even when trained only on long question-answer data. The format of the data itself probably helps: even when the context is very long, the task is still framed as a question-answer interaction in the familiar instruction-following format. Small but stable up to 512,000 tokens With this recipe and a fairly modest training budget, MMProLong beats several much larger open models like InternVL3-38B and Gemma3-27B. The model was trained on only 128,000 tokens but stays stable at 256,000 and even 512,000 token input lengths, while the original model falls apart sharply at those ranges.
[caption id="attachment_55918" align="aligncenter" width="341"] On the Needle-in-a-Haystack benchmark for long multimodal contexts, MMProLong gains an average of 29.4 points over the Qwen2.5-VL-7B base. | Image: ByteDance[/caption]
This ability also transfers to tasks the model was never specifically trained on, like understanding long videos. In an extra transfer experiment, the recipe proved effective on the stronger Qwen3-VL-8B too, even though that model is already built for long contexts.
[caption id="attachment_55919" align="aligncenter" width="333"] Even though it was trained only on documents, the gains carry over to long-video benchmarks. | Image: ByteDance[/caption]
The study is also interesting because it comes from an entirely different camp than Deepseek's widely discussed work on the same problem. Deepseek tries to extend the long memory of AI models by processing texts as images and compressing them heavily, most recently with an encoder that re-sorts visual information by content. ByteDance Seed takes the opposite approach: optimize the training data instead of the architecture.