AI News HubLIVE
站内改写7 分鐘閱讀

待翻譯:Why Video Agent models are next — Ethan He, xAI Grok Imagine

AI 服務暫時不可用,以下為來源摘要,待恢復後補全翻譯:Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and why Grok Imagine is so underrated. For the first time, we do a deep dive with the guy who led it!

AI 服務暫時不可用,以下為來源正文,待恢復後補全翻譯。

We’re announcing AIEWF speakers this week! Take the AI Engineering Survey! Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months: He comes back on Latent Space with some nuclear hot takes: that Video Models primarily get their intelligence from LLMs, not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon world models is to work on LLMs (perhaps Interaction Models as well…) Put it this way: In the near term, the next Sora won’t be a better video model, but a video agent. Generative Media may more closely follow the evolution of AI coding which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs. At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models. Now as the performance of video models increases significantly across realism, consistency, & prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task. In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models. We go deep on Grok Imagine, how a small xAI team shipped its first multimodal video model in three months, why iteration speed matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines. Flipbook: The future of Videomaxxing Video agents are almost a sure bet to be the trend in the coming year. We end with a glance at what’s beyond video agents: Flipbook caused a minor sensation this year when it was released, but most treat it as a fun demo. Ethan takes it very seriously — with the speed and cost of inference coming down every year, the future of custom video JIT UI is closer than you think. We talked about why videogen models may become the front end of AI, how generative UI could replace traditional HTML/CSS, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone. We discuss: Why fast iteration mattered more than meetings Why small training bugs can drive huge model quality gains Why coding models may make compute the bottleneck again How image and video models are trained with synthetic captions The role of VAEs and latent space in frontier video models Why image models are the foundation for video models The tradeoff between temporal compression and real-time interactivity Flipbook, Neural OS, and the future of generative UI Why future interfaces may go from user intent to pixels The hidden cost of training video models: storage, egress, and GPU hours How step distillation and consistency models (like OpenAI sCM) makes video inference orders of magnitude faster Grok Imagine 0.9 and large-scale audio-video generation Why audio-video alignment is harder than text-video alignment Ethan’s definition of world models Reference-to-video, video extension, and long-context video generation Why xAI’s research communication undersells Grok Imagine How xAI culture shaped the speed of development AI watermarking, SynthID, and detecting generated media Why prompt rewriting matters for video models Grok Imagine Agent and the rise of video agents Why language models may unlock better video generation Robotics, physical AI, and embodied world models Why Ethan left xAI and shifted focus toward LLMs Self-managed context, memory, and the next frontier for language models Ethan He LinkedIn: https://www.linkedin.com/in/ethanhe42 X: https://x.com/EthanHe_42 Timestamps 00:00:00 Introduction 00:01:25 From NVIDIA Cosmos to xAI 00:03:24 Building Grok Imagine from Zero to One 00:10:07 How Image and Video Models Are Trained 00:18:53 Video Compression, VAEs, and Real-Time Tradeoffs 00:22:10 Generative UI, Flipbook, and Neural OS 00:32:10 The Cost of Training Large Video Models 00:37:04 Distillation, GANs, and Fast Video Inference 00:41:21 Audio-Video Generation and Grok Imagine 0.9 00:48:34 What Makes a World Model? 00:55:51 Reference Videos, Long Context, and Video Memory 01:00:11 xAI Culture, Research, and First-Principles Building 01:09:45 AI Safety, Watermarking, and Prompt Rewriting 01:13:10 Video Agents and AI-Assisted Creation 01:27:32 Why Language Models Unlock Better Video 01:31:15 Robotics, Physical AI, and Embodied World Models 01:32:38 Why Ethan Left xAI 01:34:16 Self-Managed Context and the Future of LLMs 01:38:43 Ethan’s Career Path and Closing Thoughts Transcript Introduction: Ethan He, Latent Space, and the Path to xAI Swyx [00:00:00]: We’re here in the studio with Ethan He, most recently of xAI. Welcome. Ethan [00:00:10]: Thank you. Glad being here. Swyx [00:00:11]: We’re also here with Vibhu. you were first coming to us or joining the latent space world because you were working on Kosmos at NVIDIA, and you did a paper. We loved it. you presented it as well, so thank you for doing that. Ethan [00:00:23]: I’ve actually, I also presented the MoEs twice at latent space. Swyx [00:00:29]: How did you actually hear about us? Did we reach out to you? Is that how it worked? Ethan [00:00:33]: No, actually, I-- the community. Like I realized, oh, there is this online community that people talk about AI and also learn from each other through papers every week through the Paperclip. It’s very nice. Ethan [00:00:49]: I learned a lot. Swyx [00:00:49]: I think three years stop. We haven’t stopped even on Christmas and New Years. many weeks I want to stop but it keeps going. Vibhu [00:00:58]: No, that was good. I think you had posted that you worked on a paper, and I was “Oh, very cool. We have Paperclip. Present then.” Vibhu [00:01:04]: But I might have reached out to you after. Swyx [00:01:05]: you-- because it’s an amateur club, right? Swyx [00:01:08]: so it’s very unusual and but we have sometimes paper authors come by and actually explain the paper. Today we just did, the poolside paper, which was apparently very good. Vibhu [00:01:18]: Came out yesterday. Vibhu [00:01:19]: pretty interesting, right? Fully open. They talk about everything, systems. So it’s a good one. We’ll, we’ll recommend people to read it. Swyx [00:01:25]: Bring us up to speed on your transition to xAI, ‘cause I actually don’t even know when you joined. just like tell the, tell the story about the sort of transition. From NVIDIA Cosmos to xAI: Scaling Video and World Models Ethan [00:01:34]: Before xAI, I was working on Kosmos world model as in-- at NVIDIA. So Kosmos is, it’s a giant video foundation models that can-- that aims to simulate the world and for-- it serves as a foundation of-- for all of the roboticists to build on top of. There, once I built the Kosmos one, I realized as this thing also has a scaling law similar to language model, we need to scale up the video models further. that’s, that’s why I realized I need to move to somewhere with much more compute resources. That’s how I Swyx [00:02:13]: Than NVIDIA? Vibhu [00:02:14]: The GPU rich came themselves. Vibhu [00:02:19]: And timeline-wise, when was Kosmo? It was pretty early, right? It was open world model, open paper, everything. Ethan [00:02:25]: It was end of twenty-four. Vibhu [00:02:28]: End of twenty-four. Ethan [00:02:30]: Then at mid twenty-five, I moved to xAI. At that time-- I joined about the time when xAI was about to build video models and in multi-model models. There were no infra, no data, and no model, and it just-- as a few engineers, we built it in three months and released the first model, Grok Imagine zero point nine. Ethan [00:02:55]: And since then, I keep working on video models and move more from training and to post-training of the video models. For example, like a reference to videos, kind of like the cameo feature and, video extensions. And, before I left, I worked on a world model, leading a small team to focus on the real-time long horizon video generation. Building Grok Imagine From Scratch in Three Months Swyx [00:03:24]: Can you give like a rough roadmap of okay, you’re on a brand-new team. Grok previously was only text, or they partnered with BFL for their image gen stuff. What do you-- what are the building blocks, right? You have compute, data you can procure somewhere. Like just what are like the sequence of things that people should think about when you’re setting up a new team? Vibhu [00:03:43]: actually even deeper, not just data you can procure. You guys had to go through getting the data too, right? So you shipped it pretty fast, but yeah Swyx [00:03:51]: three months is like Vibhu [00:03:52]: From everything Swyx [00:03:52]: actually like very surprisingly fast. Ethan [00:03:55]: One thing I say like thanks to my experience at NVIDIA, ‘cause first time when we were building Kosmos together, we built it, for about a year. So this is like the second time I do it. Roughly have an idea, what to do. I say the most important thing is the talent. Everyone were very strong and clever, very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can work towards the same goal. It’s, it’s like every day there’s not that much meetings on the calendar, like maybe like a, like a sync a day, and after that it’s, it’s just all building. It was pretty fun at that time. Ethan [00:04:47]: And another thing is that xAI has very strong foundations of like data inference, model inference, and the supporting there can help the model develop a lot. When I look at, training models, I don’t so actually the top important thing is like how many, how many iterations can you do, per day? and the more iteration can you do, you can, you can train the model much faster. So if you have very strong infra and you have a lot of compute, you can, you can train these models in very short period of time. That can give you a much larger buffer to, for errors, and it also gives you the opportunity to spot more bugs. Iteration Speed, Compute, and Debugging Model Pipelines Swyx [00:05:46]: What is an iteration? Is it like a few hundred steps or what are you Ethan [00:05:50]: Let’s say just the train-training the model, like from acquire new data and maybe design new algorithms and train a new model, maybe at smaller scale or Swyx [00:06:01]: So cycle time for like any hyperparam that you’re searching. Ethan [00:06:04]: Cycle time and tune to like eval this model. Is this model better than my previous iteration? Ethan [00:06:11]: So Swyx [00:06:11]: So it’s like before you, someone had already set this up that you can iterate very quickly. Ethan [00:06:15]: I think the foundation there is extremely good forDeveloping and research models. Ethan [00:06:23]: And often I find is it-- this is kind of boring, but like a lot of the improvements does not come from new algorithms. It comes from finding small bugs here and there in the data pipeline, in the, in the model training pipeline. Those give, those give the biggest boost to the model quality. Vibhu [00:06:46]: It’s interesting, right? So you say it’s like small team, less communication [truncated for AI cost control]