2026-05-19原文2 min readUpdated: 2026-06-12

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Recent research shows Universal Multimodal Embedding (UME) benefits from Chain-of-Thought (CoT) reasoning, but explicit CoT traces are computationally expensive. This paper proposes replacing explicit CoT with latent think tokens, which serve as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and embedding tokens using contrastive loss, the model achieves high-performance, reasoning-aware representations at constant inference cost. The introduced TTE-Flash-2B model outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, with interpretable think tokens. Zero-shot evaluation across 15 video datasets shows scaling behavior with more think tokens, motivating adaptive think budget allocation.

SourcearXiv AIAuthor: Jianpeng Cheng, Xian Wu, Jiangfan Zhang, Wentao Bao, Chaitanya Ahuja, Shlok Kumar Mishra, Hanchao Yu, Yang Gao, Fan Xia, Qi Guo, Shaodan Zhai, Xiangjun Fan, Jun Xiao

[2605.16638] TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

[Submitted on 15 May 2026]

Title:TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

View a PDF of the paper titled TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens, by Jianpeng Cheng and 12 other authors

View PDF HTML (experimental)

Abstract:Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2605.16638 [cs.AI]

(or arXiv:2605.16638v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.16638

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xian Wu [view email] [v1] Fri, 15 May 2026 21:10:56 UTC (42,978 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens, by Jianpeng Cheng and 12 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.AI

new | recent | 2026-05

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)