Generation Result of Vid2World. We show the generation results of Vid2World in the videos below.
World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present _Vid2World_, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores _video diffusion causalization_, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a _causal action guidance_ mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.
Vid2World is a general framework for transforming full-sequence, non-causal, passive video diffusion models into autoregressive, interactive, action-conditioned world models. At it's core, it represents a paradigm shift in world modeling from relying solely on expensive and limited action-labeled data to leveraging internet-scale action-free videos.
Transforming a video diffusion model to an interactive world model demands solving two fundamental gaps: (i) Enabling causal generation—standard VDMs denoise whole sequences with bidirectional context, making it unsuitable for causal rollouts where predictions only depend on past information; (ii) Enforcing action conditioning—VDMs are usually conditioned only on coarse, video-level inputs (e.g. text, image), lacking the ability to perform frame-level fine-grained action-conditioned predictions.
Video Diffusion Causalization. Vid2World re-architects the pretrained diffusion backbone so that each frame is generated from past context only. From the architectural perspective, it applies causal masks to temporal attention layers, and explore novel weight transfer mechanisms for temporal convolution layers. From the training perspective, to enable auto-regressive generation capabilities, it samples independent and uniform noise levels in different frames, following Diffusion Forcing. These changes retain the model's learned visual priors while enabling unlimited-horizon, stepwise causal rollout.
Causal Action Guidance. To make the causal model responsive to fine-grained actions, Vid2World extends classifier-free guidance to sequential settings. Each action is encoded with a lightweight MLP and added at its corresponding frame; during training the action is independently dropped with fixed probability, forcing the network to learn both conditional and unconditional score functions. At test time these the conditional score function and it's unconditional counterpart with the most recent action dropped out are linearly combined with a tunable guidance scale λ, offering test-time flexibility on the responsiveness to fine-grained action variations.
As a result, Vid2World unifies the high-fidelity synthesis of full-sequence diffusion with the causal reasoning and controllability of interactive world models, delivering high-fidelity, precise, and action-aware predictions in an auto-regressive manner.
Here we utilize the RT-1 dataset. We provide a list of videos generated by Vid2World along side the ground truth videos (denoted as GT). The generated videos are given the first frame and a action sequence, rolled out in an auto-regressive manner.
As shown in the videos, Vid2World is capable of generating predictions that closely resemble the ground truth sequences, both in terms of visual fidelity and dynamic consistency.
To demonstrate our method's capabilities to aid downstream tasks in interactive environments,
we conduct Real2Sim Policy Evaluation experiments,
following SIMPLER.
Here, given an initial frame, the predefined policies interact with the world model interactively, rolling out
trajectories within the world model. We use the close_drawer
task, where the goal is to close the drawer,
and the policies are three checkpoints of RT-1 at different stages of training.
As shown in the figure, Vid2World reliably captures and reflects the behavioral differences and performance variations among different policies, demonstrating its capability to distinguish successful and failed executions through autoregressive rollouts.
Here we adopt the celebrated game of CS:GO. We provide a list of videos generated by Vid2World along side the ground truth videos (denoted as GT). To make the generated videos comparable to baselines, we provide the first four frames and a action sequence, and the videos are generated in an auto-regressive manner. As demonstrated in the videos, Vid2World generates predictions that closely align with the ground truth sequences, exhibiting high visual fidelity and consistent temporal dynamics.
To showcase our model's capability of counterfactual video generation with the current action, instead of predicting trends based solely on past observations, we the following video, where all trajectories start from the same observation, but leading to completely different generated frame sequences in cause of different action sequences.
To further validate the effectiveness of our method, we showcase examples of Vid2World's predictions compared to the strong baseline of DIAMOND, under two configurations.
Error Accumulation. A common issue with auto-regressive generation is error accumulation, where errors in early frames are amplified as the generation progresses. Here, we show a comparison between Vid2World and DIAMOND, where the former is able to provide temporally consistent predictions, while the latter accumulates errors, resulting in a progressively blurred video.
Action Alignment. The reliability of a world model, to a large extent, depends on how well its predictions align with the input actions. Vid2World accurately reflects the aim-down-sights action in its predicted video, whereas Diamond fails to manifest this action.
Limitations. Despite substantially reducing accumulated error and preserving action alignment, Vid2World still encounters failure cases, demonstrated in Figure 10. In this figure, neither Vid2World nor Diamond matches the ground truth. Although the model's capability is one important factor leading to failure, the environment's randomness, in this case the place for player's respawn also adds to the difficulty
Here we make use of RECON, a widely adopted open-world navigation dataset. We present a set of videos produced by Vid2World alongside the ground-truth videos (denoted as GT). Following standard baselines, the model is conditioned on the first four frames and the corresponding action sequence, with subsequent frames generated in an auto-regressive fashion. As shown in the results, Vid2World produces predictions that closely match the ground truth, demonstrating both high visual fidelity and consistent temporal coherence.
@article{huang2025vid2world0,
title = {Vid2World: Crafting Video Diffusion Models to Interactive World Models},
author = {Siqiao Huang and Jialong Wu and Qixing Zhou and Shangchen Miao and Mingsheng Long},
year = {2025},
journal = {arXiv preprint arXiv: 2505.14357}
}