2026-06-16原文2 min readUpdated: 2026-06-16

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

This paper formalizes embedding model routing as an adversarial contextual linear bandit with low-rank experts, introduces a log-quadratic policy class for efficient online learning, and proposes the Hypentropy Policy Gradient (HPG) algorithm that achieves sublinear regret without curse of dimensionality.

SourcearXiv Machine LearningAuthor: Yan Dai, Negin Golrezaei, Patrick Jaillet

[2606.14929] Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

[Submitted on 12 Jun 2026]

Title:Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

View a PDF of the paper titled Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts, by Yan Dai and 2 other authors

View PDF HTML (experimental)

Abstract:Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret -- where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds -- thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Cite as: arXiv:2606.14929 [cs.LG]

(or arXiv:2606.14929v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.14929

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yan Dai [view email] [v1] Fri, 12 Jun 2026 20:09:03 UTC (75 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts, by Yan Dai and 2 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.LG

new | recent | 2026-06

Change to browse by:

cs cs.AI stat stat.ML

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)