2026-06-25 04:00 UTCOriginal source2 min readUpdated: 2026-06-25 08:00 UTC

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Wan-Streamer is a native-streaming, end-to-end interactive foundation model designed for real-time, low-latency, full-duplex audio-visual interaction. It seamlessly models language, audio, and video within a single Transformer using block-causal attention for incremental streaming, without external modules. It achieves ~200ms model-side and ~550ms total interaction latency, enabling sub-second duplex communication.

SourcearXiv Computer VisionAuthor: Lianghua Huang, Zhifan Wu, Wei Wang, Yupeng Shi, Mengyang Feng, Junjie He, Chenwei Xie, Yu Liu, Jingren Zhou, Ang Wang, Bang Zhang, Baole Ai, Chen Liang, Cheng Yu, Chongyang Zhong, Jinwei Qi, Kai Zhu, Pandeng Li, Peng Zhang, Wenyuan Zhang, Xinhua Cheng, Yitong Huang, Yun Zheng, Zoubin Bi

[2606.25041] Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

[Submitted on 23 Jun 2026]

Title:Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

View a PDF of the paper titled Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models, by Lianghua Huang and 23 other authors

View PDF HTML (experimental)

Abstract:We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

Comments: Website: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Sound (cs.SD)

Cite as: arXiv:2606.25041 [cs.CV]

(or arXiv:2606.25041v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.25041

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Lianghua Huang Dr. [view email] [v1] Tue, 23 Jun 2026 18:01:03 UTC (3,818 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models, by Lianghua Huang and 23 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CV

new | recent | 2026-06

Change to browse by:

cs cs.AI cs.GR cs.SD

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)