2026-06-02 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

This paper introduces a method with delayed per-step reward attribution, eligibility gating, asynchronous rollout generation via vLLM, curriculum-based opponent sampling, and multi-level stratified batch construction. An 8B parameter open-source model trained with this approach matched or surpassed larger proprietary systems like GPT-5 in the NeurIPS 2025 MindGames Arena benchmark, winning first place in both the Open and Efficient tracks.

SourcearXiv AIAuthor: Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov

[2606.00017] MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

[Submitted on 13 Apr 2026]

Title:MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

View a PDF of the paper titled MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution, by Aliaksei Korshuk and 2 other authors

View PDF HTML (experimental)

Abstract:Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (

new | recent | 2026-06

Change to browse by:

cs cs.CL cs.MA

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)