Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
This paper proposes STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method that replaces the covariance metric with the symmetric part of the behavior-policy Bellman matrix to improve off-policy prediction speed. Theoretical convergence analysis and numerical experiments on several benchmarks show improved performance over GTD2-MP.
Article intelligence
Key points
- STHTD-MP uses behavior-policy transition information to construct a more informative update geometry.
- Rigorous convergence analysis is provided for fixed-policy linear prediction.
- Numerical analysis on two-state, Random Walk, and Boyan Chain benchmarks shows smaller mean contraction factor.
- Baird's counterexample is identified as a singular boundary case where assumptions fail.
Why it matters
This matters because STHTD-MP uses behavior-policy transition information to construct a more informative update geometry.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
[2605.28849] Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
[Submitted on 16 May 2026]
Title:Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
View a PDF of the paper titled Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction, by Xingguo Chen and 5 other authors
View PDF HTML (experimental)
Abstract:Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary-variable metric. Existing Mirror-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior-policy transition information can provide a more informative update geometry. This paper proposes a behavior-induced Mirror-Prox temporal-difference method, called STHTD-MP, which replaces the covariance metric in the primal-dual saddle-point formulation with the symmetric part of the behavior-policy Bellman matrix. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror-Prox prediction-correction step to the resulting hybrid saddle-point operator. We provide a formal convergence analysis for fixed-policy linear prediction under standard stochastic approximation assumptions: the behavior-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method. We further derive projected-oracle ergodic gap bounds and an exact mean-operator comparison with GTD2-MP based on the spectral radius of the deterministic Mirror-Prox error matrix. The analysis shows that STHTD-MP can have a smaller mean contraction factor than GTD2-MP when the behavior-induced metric improves the saddle-point geometry. Exact numerical mean-operator analysis on two-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird's counterexample is identified as a singular boundary case where the strict assumptions fail.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2605.28849 [cs.AI]
(or arXiv:2605.28849v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2605.28849
arXiv-issued DOI via DataCite
Submission history
From: Xingguo Chen [view email] [v1] Sat, 16 May 2026 11:33:44 UTC (4,782 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction, by Xingguo Chen and 5 other authors
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.AI
new | recent | 2026-05
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)