2026-05-29 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

This paper proposes STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method that replaces the covariance metric with the symmetric part of the behavior-policy Bellman matrix to improve off-policy prediction speed. Theoretical convergence analysis and numerical experiments on several benchmarks show improved performance over GTD2-MP.

SourcearXiv AIAuthor: Xingguo Chen, Yuchen Shen, Shangdong Yang, Chao Li, Guang Yang, Wenhao Wang

[2605.28849] Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

[Submitted on 16 May 2026]

Title:Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

View a PDF of the paper titled Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction, by Xingguo Chen and 5 other authors

View PDF HTML (experimental)

Abstract:Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary-variable metric. Existing Mirror-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior-policy transition information can provide a more informative update geometry. This paper proposes a behavior-induced Mirror-Prox temporal-difference method, called STHTD-MP, which replaces the covariance metric in the primal-dual saddle-point formulation with the symmetric part of the behavior-policy Bellman matrix. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror-Prox prediction-correction step to the resulting hybrid saddle-point operator. We provide a formal convergence analysis for fixed-policy linear prediction under standard stochastic approximation assumptions: the behavior-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method. We further derive projected-oracle ergodic gap bounds and an exact mean-operator comparison with GTD2-MP based on the spectral radius of the deterministic Mirror-Prox error matrix. The analysis shows that STHTD-MP can have a smaller mean contraction factor than GTD2-MP when the behavior-induced metric improves the saddle-point geometry. Exact numerical mean-operator analysis on two-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird's counterexample is identified as a singular boundary case where the strict assumptions fail.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2605.28849 [cs.AI]

(or arXiv:2605.28849v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2605.28849

arXiv-issued DOI via DataCite

Submission history

From: Xingguo Chen [view email] [v1] Sat, 16 May 2026 11:33:44 UTC (4,782 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction, by Xingguo Chen and 5 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.AI

new | recent | 2026-05

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)