2026-05-27 09:01 UTCIn-site rewrite2 min readUpdated: 2026-06-30 13:03 UTC

Peking University, CUHK, and Shanghai AI Lab Develop VGGT-Edit: 3D Scene Editing in 5 Seconds with 120x Speedup

Researchers from Peking University, The Chinese University of Hong Kong, Shanghai AI Lab, and NTU have introduced VGGT-Edit, a native 3D editing framework that performs scene editing in approximately 5 seconds, achieving up to 120x acceleration over traditional methods. It outperforms existing approaches in semantic consistency, multi-view stability, and inference speed.

Source量子位Author: 听雨

Article intelligence

EngineersAdvanced

Key points

VGGT-Edit is the first native 3D editing framework that operates directly in 3D space, eliminating multi-view inconsistencies caused by 2D approaches.
Residual field prediction enables the model to modify only local changes while keeping the background stable, ensuring fast and high-quality edits.
Depth-synchronized text injection continuously aligns text semantics with 3D spatial features, improving editing accuracy.
A new dataset, DeltaScene, with nearly 100k samples, was created to train and evaluate 3D editing tasks.

Why it matters

This matters because VGGT-Edit is the first native 3D editing framework that operates directly in 3D space, eliminating multi-view inconsistencies caused by 2D approaches.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Recent advances in 3D reconstruction models like NeRF, 3D Gaussian Splatting, and feed-forward models such as VGGT and π³ have made it possible to reconstruct entire 3D scenes from just a few images in seconds. However, a critical capability has remained elusive: editing those 3D scenes. While current models can "see" the 3D world, they struggle to "modify" it. For instance, you can reconstruct a room but cannot easily instruct the system to move a chair to the window, delete a specific chair, or change a gray leather sofa to a white fluffy one. Existing methods often suffer from inconsistencies across viewpoints, with objects disappearing and reappearing or backgrounds deforming unexpectedly.

To address this challenge, a research team from Peking University, The Chinese University of Hong Kong, Shanghai AI Lab, and Nanyang Technological University (NTU) has proposed VGGT-Edit, a native 3D editing framework. The core idea is straightforward: perform edits directly in 3D space instead of falling back to 2D. The framework builds upon VGGT-like feed-forward reconstruction models, inheriting their fast and efficient 3D representation. Instead of regenerating the entire scene, VGGT-Edit introduces a clever mechanism called residual field prediction. This means the model retains the original stable 3D structure and only learns where changes are needed—such as moving a chair, altering material, deleting an object, or adding furniture. The new scene is computed as the original scene plus local residual changes, ensuring that untouched background regions remain highly stable.

A key innovation is depth-synchronized text injection. Simply feeding a text instruction often fails because the model knows what to modify but not where. VGGT-Edit continuously fuses text semantics with 3D spatial features across multiple depth layers, ensuring the model always knows which region to edit, what the target change is, and its spatial location. Additionally, a view importance weighting mechanism automatically identifies which viewpoints are more reliable, leading to more consistent multi-view results.

The framework also includes a specialized editing head tailored for 3D editing tasks. While the original reconstruction head focuses on recovering the scene, the editing branch predicts local changes directly in the 3D representation space. This design learns which areas should remain unchanged, which need editing, and how to maintain multi-view consistency after editing. Compared to regenerating the entire scene, this approach is more stable and efficient.

To train VGGT-Edit, the team created a new dataset called DeltaScene, containing nearly 100,000 samples covering various scenes like living rooms, offices, and commercial spaces. The data generation pipeline is highly automated, leveraging tools like Qwen3.5-Plus, SAM3, and Qwen-Image-Editing-Max to generate editing instructions, identify targets, perform multi-view editing, and filter for 3D consistency. This ensures the model learns not just image changes but how edits remain spatially consistent across different viewpoints.

Results on the DeltaScene benchmark show VGGT-Edit outperforms existing methods in semantic consistency, multi-view stability, and inference speed. It completes a single edit in about 5 seconds, achieving up to 120x acceleration over traditional optimization-based methods. This brings 3D editing close to real-time interaction, which is crucial for robotics, digital twins, and AR/VR applications.

An interesting experiment involved an unseen instruction: "Rotate the middle chair 90 degrees clockwise." The model successfully executed the edit, demonstrating that VGGT-Edit learns to map text semantics to 3D spatial changes beyond fixed templates. This ability to understand and modify the 3D world in a flexible, stable, and real-time manner marks a significant step toward interactive spatial intelligence.

Paper: https://arxiv.org/abs/2605.15186