2026-04-29 00:00 UTCOriginal source2 min readUpdated: 2026-06-27 00:25 UTC

Adaptive Thinking: Large Language Models Know When to Think in Latent Space

Apple Machine Learning Research introduces Sonata, a lightweight adapter that uses self-consistency prediction to dynamically allocate thinking budgets during inference, reducing thinking tokens by 20-80% while maintaining accuracy, or improving accuracy by up to 5% with the same token cost.

SourceApple Machine Learning Research

Adaptive Thinking: Large Language Models Know When to Think in Latent Space - Apple Machine Learning Research

Machine Learning Research

Open MenuClose Menu

Overview

Research Highlights

Publications

Events

Work with us

research area Methods and Algorithmsconference ICLR

content type paperpublished April 2026

Adaptive Thinking: Large Language Models Know When to Think in Latent Space

AuthorsPingzhi Li†‡, Bairu Hou, Yun Zhu†, Yihao Feng, Ke Ye†, Tao Lei, Zhifeng Chen, Tianlong Chen‡, Xianzhi Du

View publication

Copy Bibtex

Recent advances in large language models (LLMs) test-time computing have introduced the capability to perform intermediate chain-of-thought (CoT) reasoning (thinking) before generating answers. While increasing the thinking budget yields smooth performance improvements at inference time, the relationship between LLM capability, query complexity, and optimal budget allocation remains poorly understood for achieving compute-optimal inference. To address this challenge, we utilize self-consistency, the agreement among multiple reasoning paths, as a proxy for thinking necessity. We first identify that lower self-consistency indicates when queries require extended thinking to reach correct answers. Building on this insight, we introduce Sonata (Self-Consistency-Guided Adapter for Thinking Allocation), a lightweight approach that adaptively allocates thinking budgets to optimize the performance-efficiency tradeoff. Sonata includes a adapter trained offline on a calibration dataset to predict self-consistency directly from last layer hidden representations during the query prefilling stage. This prediction then guides on-the-fly budget allocation before thinking. The adapter is general, transferrable across diverse tasks once trained, and introducing almost zero computational overhead during inference. Notably, Sonata is orthogonal to existing CoT compression methods, enabling further efficiency gains when managing thinking budgets across queries. Extensive experiments on multiple models (Qwen3-8B, GPT-OSS-120B, Qwen3-235B-A22B, Intern-S1-mini) and benchmarks (AIME24, AIME25, GSM8K, MATH500, GPQA) demonstrate that Sonata achieves 20% to 80% reduction in thinking tokens while maintaining the same accuracy, or up to 5% improvement in accuracy with same token cost.

†Work done while at Apple

‡The University of North Carolina at Chapel Hill