2026-05-25原文

The TIME Machine: On The Power of Motion for Efficient Perception

A novel approach using motion as the central modality for video representation, training a masked autoencoder on point-tracks in a self-supervised manner. The resulting TIME embedding, trained solely on synthetic motion data, achieves performance on par with state-of-the-art models using up to 4 orders of magnitude less data.

Article intelligence

EngineersAdvanced

Key points

Uses point-tracks to represent motion and trains a masked autoencoder to reconstruct missing tracks.
Self-supervised learning bypasses language dependency and reduces need for large-scale training data.
TIME embedding, trained on synthetic motion data, matches SOTA zero-shot performance with 40,000x less data.

Why it matters

This matters because uses point-tracks to represent motion and trains a masked autoencoder to reconstruct missing tracks.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

[2605.23045] The TIME Machine: On The Power of Motion for Efficient Perception

[Submitted on 21 May 2026]

Title:The TIME Machine: On The Power of Motion for Efficient Perception

View a PDF of the paper titled The TIME Machine: On The Power of Motion for Efficient Perception, by Mantas Skackauskas and 2 other authors

View PDF HTML (experimental)

Abstract:Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2605.23045 [cs.CV]

(or arXiv:2605.23045v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2605.23045

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Mantas Skackauskas [view email] [v1] Thu, 21 May 2026 21:22:42 UTC (2,617 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled The TIME Machine: On The Power of Motion for Efficient Perception, by Mantas Skackauskas and 2 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-05

Change to browse by:

cs cs.AI cs.LG

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)