Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang*^1,2, Jialong Wu*¹, Qixing Zhou^1,3, Shangchen Miao¹, Mingsheng Long#¹

¹School of Software, BNRist, Tsinghua University, ²IIIS, Tsinghua University

³College of Computer Science, Chongqing University

* Equal Contribution, # Corresponding Author

TL;DR: Vid2World is a general approach for transforming video diffusion models (like SORA) into interactive world models (like Genie), leveraging the high fidelity of full-sequence diffusion to enable causal, autoregressive, and action-conditioned generation.

Generation Result of Vid2World. We show the generation results of Vid2World in the videos below.

Abstract

World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.

🎥 Vid2World 🌍

Vid2World is a general framework for transforming full-sequence, non-causal, passive video diffusion models into autoregressive, interactive, action-conditioned world models.

Transforming a video diffusion model to an interactive world model demands solving two fundamental gaps: (i) Enabling causal generation—standard VDMs denoise whole sequences with bidirectional context, making it unsuitable for causal rollouts where predictions only depend on past information; (ii) Enforcing action conditioning—VDMs are usually conditioned only on coarse, video-level inputs (e.g. text, image), lacking the ability to perform frame-level fine-grained action-conditioned predictions.

Video Diffusion Causalization. Vid2World re-architects the pretrained diffusion backbone so that each frame is generated from past context only. From the architectural perspective, it applies causal masks to temporal attention layers, and proposed a mixed weight transfer scheme for temporal convolution layers. From the training perspective, to enable auto-regressive generation capabilities, it samples independent and uniform noise levels in different frames, following Diffusion Forcing. These changes retain the model's learned visual priors while enabling unlimited-horizon, stepwise causal rollout.

Causal Action Guidance. To make the causal model responsive to fine-grained actions, Vid2World extends classifier-free guidance to sequential settings. Each action is encoded with a lightweight MLP and added at its corresponding frame; during training the action is independently dropped with fixed probability, forcing the network to learn both conditional and unconditional score functions. At test time these the conditional score function and it's unconditional counterpart with the most recent action dropped out are linearly combined with a tunable guidance scale λ, offering test-time flexibility on the responsiveness to fine-grained action variations.

As a result, Vid2World unifies the high-fidelity synthesis of full-sequence diffusion with the causal reasoning and controllability of interactive world models, delivering high-fidelity, precise, and action-aware predictions in an auto-regressive manner.

🤖 Vid2World for Robot Manipulation 🦾

Here we utilize the RT-1 dataset. We provide a list of videos generated by Vid2World along side the ground truth videos (denoted as GT). The generated videos are given the first frame and a action sequence, rolled out in an auto-regressive manner.

As shown in the videos, Vid2World is capable of generating predictions that closely resemble the ground truth sequences, both in terms of visual fidelity and dynamic consistency.

To demonstrate our method's capabilities to aid downstream tasks in interactive environments, we conduct Real2Sim Policy Evaluation experiments, following SIMPLER. Here, given an initial frame, the predefined policies interact with the world model interactively, rolling out trajectories within the world model. We use the close_drawer task, where the goal is to close the drawer, and the policies are three checkpoints of RT-1 at different stages of training.

As shown in the figure, Vid2World reliably captures and reflects the behavioral differences and performance variations among different policies, demonstrating its capability to distinguish successful and failed executions through autoregressive rollouts.

🎮 Vid2World for Game Simulation 🕹️

Here we adopt the celebrated game of CS:GO. We provide a list of videos generated by Vid2World along side the ground truth videos (denoted as GT). To make the generated videos comparable to baselines, we provide the first four frames and a action sequence, and the videos are generated in an auto-regressive manner. As demonstrated in the videos, Vid2World generates predictions that closely align with the ground truth sequences, exhibiting high visual fidelity and consistent temporal dynamics.

To showcase our model's capability of counterfactual video generation with the current action, instead of predicting trends based solely on past observations, we the following video, where all trajectories start from the same observation, but leading to completely different generated frame sequences in cause of different action sequences.

To further validate the effectiveness of our method, we showcase examples of Vid2World's predictions compared to the strong baseline of DIAMOND, under two configurations.

Error Accumulation. A common issue with auto-regressive generation is error accumulation, where errors in early frames are amplified as the generation progresses. Here, we show a comparison between Vid2World and DIAMOND, where the former is able to provide temporally consistent predictions, while the latter accumulates errors, resulting in a progressively blurred video.

Action Alignment. The reliability of a world model, to a large extent, depends on how well its predictions align with the input actions. Vid2World accurately reflects the aim-down-sights action in its predicted video, whereas Diamond fails to manifest this action.

Limitations. Despite substantially reducing accumulated error and preserving action alignment, Vid2World still encounters failure cases, demonstrated in Figure 10. In this figure, neither Vid2World nor Diamond matches the ground truth. Although the model's capability is one important factor leading to failure, the environment's randomness, in this case the place for player's respawn also adds to the difficulty

Citation

    @article{huang2025vid2world0,
      title   = {Vid2World: Crafting Video Diffusion Models to Interactive World Models},
      author  = {Siqiao Huang and Jialong Wu and Qixing Zhou and Shangchen Miao and Mingsheng Long},
      year    = {2025},
      journal = {arXiv preprint arXiv: 2505.14357}
    }