2026-07-01 04:00 UTCOriginal source2 min readUpdated: 2026-07-01 08:16 UTC

ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

ViTL framework uses LLMs to compile natural language commands into Linear Temporal Logic formulas, which are converted into Deterministic Finite Automata to coordinate multi-channel value maps. It introduces a directional score for navigation, enabling zero-shot completion of multi-target, temporally constrained navigation tasks. Experiments on HM3D demonstrate its effectiveness.

SourcearXiv RoboticsAuthor: Kaier Liang, Hengde Dai, Cristian-Ioan Vasile

-->

[Submitted on 29 Jun 2026]

Title:ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models

View a PDF of the paper titled ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models, by Kaier Liang and 2 other authors

View PDF HTML (experimental)

Abstract:Enabling robots to follow natural language commands to complete zero-shot long-horizon tasks remains challenging. It requires extracting implicit temporal and logical constraints from natural language commands and executing multiple sub-tasks accordingly. Recent zero-shot object navigation methods use vision-language models (VLMs) to guide frontier-based exploration in unknown environments, but they are limited to single-target tasks. Real-world commands such as "Clean either the chair or the couch, then turn on the tv." require navigating to multiple targets in a temporally constrained order, which no existing zero-shot system can handle. We present ViTL, a framework that addresses this gap at two levels. At the task level, we use a large language model (LLM) to compile natural language commands into Linear Temporal Logic (LTL) formulas, which are then converted into Deterministic Finite Automata~(DFA) that coordinate multi-channel value maps and trigger dynamic replanning when new objects are detected. At the navigation level, we introduce directional score: rather than producing a direction-agnostic value across the entire field of view, we label frontier directions on the observation image and extract per-direction scores from the VLM. Experiments on Habitat-Matterport 3D (HM3D) show that the full framework enables zero-shot long-horizon completion of natural language navigation tasks with temporal constraints, and that directional score improves single-target navigation accuracy and efficiency over the baseline.

Subjects:

Robotics (cs.RO); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as: arXiv:2606.30696 [cs.RO]

(or arXiv:2606.30696v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2606.30696

arXiv-issued DOI via DataCite

Submission history

From: Kaier Liang [view email] [v1] Mon, 29 Jun 2026 02:22:31 UTC (1,446 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models, by Kaier Liang and 2 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.RO

new | recent | 2026-06

Change to browse by:

cs cs.CL cs.LG

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)