Beyond the Hype: How I See World Models Evolving in 2025

Recently, world models have attracted significant interest from researchers and a broader community of technology enthusiasts, largely attributed to the viral success of Google’s Genie 3¹.

In fact, before this, many quite successful world models already existed, including Meta’s V-JEPA 2², Nvidia’s Cosmos³, and others. After Genie3, many major companies have successively released their own world models, such as: Hunyuan-Gamecraft⁴, Matrix-Game 2.0⁵ and Yan⁶.

Personally, I have always been a huge believer in the necessity of world models and have done some related work in this field. It’s undoubtedly exciting to see world models finally entering the public consciousness. However, humans always tend to “overhype” emerging technologies at the beginning, harboring some unrealistic expectations. Hence here, I want to share some personal thoughts on the current state and future development of world models, representing only my personal views.

1. Do World Models Need Explicit 3D Modeling?

Current industry world models basically follow two fundamental approaches:

Pixel-space world models, i.e., action-conditioned video generation. This approach is well-studied in academia, coming from many years of research, and has achieved considerable success in industry.
3D Mesh-space world models, which have strong connections with 3D Vision. Notable companies include Prof. Feifei Li’s WorldLabs⁷ and Tesseract⁸, among others.

Although 3D Mesh has various advantages in comparison to pixel space predictions (geometry consistency, temporal consistency, etc.), I believe that in today’s era of abundant video data, learning world models autoregressively from video data (whether action-labeled or action-free) holds more potential to scale up, compared to learning world models from the relatively scarce 3D data.

However, world models on 3D Mesh will continue to exist in some specialized scenarios and remain the dominant approach in that domain. For example, in scenarios involving depth information and contact-rich embodied environments, 3D representations will continue to be crucial.

Tesseract: An academic 3D Mesh World Model

2. Will World Models Be the Next Big Thing?

In the past two weeks, there have been many optimistic predictions about the future of world models. The biggest claim (besides humans living in some kind of simulation) is probably that World Models are the Next Big Thing in Generative Models after LLMs.

Here I want to explain from two aspects why we shouldn’t have excessive expectations for world models.

From a data perspective⁹, video data is abundant, but data with action information annotations is scarce. It can even be said that for vision-based data collection schemes, the total video data volume is strictly greater than video data with action information annotations.

Data Pyramid for Robot Learning, Photo Credit: Yuke Zhu

From a learning objective perspective, what makes world models more difficult is the Heterogeneity of Action Spaces¹⁰. The previous success of generative models (especially in sequential modeling) often relied on unified data formats, such as tokens in LLMs, pixels in image/video generation models, and point clouds in 3D space. However, action spaces across different embodiments inherently lack such homogeneity. A world model without a unified action space cannot become a ready-to-use foundation model, and more research breakthroughs are needed before realizing a foundational world model across embodiments.

RDT-1B is an embodied foundation model that learns an Unified Action Space.

Therefore, I believe that in the next era, the Next Big Thing will be multi-modal video generation models, with world models being their subsidiary products in action/language space control. The success of Genie3 has already shown us that Diffusion Forcing¹¹ + action injection modules (e.g., AdaLN¹²), given sufficient data, can achieve stunning visual effects. However, at least from my perspective, a more worthwhile research question for the next few years is how to derive world models from existing video generation models¹³.

Vid2World transforms Full-Sequence Video Diffusion Models into Interactive World Models.

3. World Models: Cop or Drop?

Continuing from the previous topic, for a researcher, at the current time point, is diving into world model-related research an ideal choice?

The answer to this question varies greatly from individual to individual. If you believe in the prospects of world models, then go for it! If you completely don’t believe in world models (e.g., helping embodied AI policy learning), then of course don’t do such research.

Here, I want to point out that at the current time point, the design choices for world models have largely converged. Algorithmically, Diffusion Forcing (or Self Forcing), and architecturally, video generation model architectures (e.g., UNet, DiT) + action modules (e.g., AdaLN) will be the mainstream pattern in the future. So for students pursuing really challenging topics, world models may have become a relatively mundane field. However, for students with strong engineering capabilities who really want to make things work, the current time is truly a moment when we can see world models transition from not working to working¹⁴, with very cool visuals and truly scaled-up foundation models that will be achievable within the next 3 years.

Simulated Rollouts from UniSim, an academic pixel-space world model.

Personally, I believe that the following directions in world modeling will be very much worth pursuing in the next few years:

3.1. How to deploy World Models to the Physical World, i.e., helping Embodied AI Policy Learning through World Models.

This will be detailed in the next section.

Google DeepMind’s Gemini Robotics model can perform complex real-world tasks.

3.2. How to make World Models go towards Long-sequence, achieving Minute-level Temporal Memory / Consistency.

Although Genie3’s¹ blog mentions that this temporal consistency is an emergent behavior, relying solely on data-driven approaches to achieve consistent memory is unrealistic. For long sequences, while pioneering research has already started in this topic¹⁵, I believe we might need an SSM-style hidden state, or some kind of memory retrieval, but personally I feel this won’t be a problem that can be perfectly solved just by scaling up data volume.

Worldmem learns Long-term Consistent World Simulation with Memory.

Current world models are simulation systems that rely entirely on sensorimotor information. However, Google¹⁶ tells us that within powerful generalist policy models, there exist good-enough world models. Language, as the only modality that natively supports expressing abstract information, possesses the integration of high-level abstract knowledge and low-level sensory knowledge, which is an indispensable component of future world models. How to integrate LLMs/MLLMs into the world model framework and incorporate their rich world knowledge into existing systems¹⁷ is a very interesting direction.

Reasoning for Language Models is Planning for Embodied Agents, Photo Credit: Zhiting Hu.

3.4. Making World Models truly Real-Time.

The natural problem brought by data-driven methods is the high inference latency due to model complexity. If world models want to a). become truly playable Neural Game Engines¹⁸ ¹⁹; b). help embodied AI in an online manner; accelerating world model inference²⁰ is a crucial step. This is a joint effort, including hardware acceleration, algorithmic innovation, and the entire community ecosystem.

Making World Models real-time is essential for human entertainment purposes, Photo Credit: Xun Huang.

3.5. Multi-Agent World Models.

Currently, all the world models we see are Single-Agent world models. However, if we want a Neural Game Engine that can support multiplayer games, exploring the capabilities of world models in multi-agent scenarios is an overlooked direction. Simply concatenating each individual’s action space faces exponentially growing data requirements with the number of players (to achieve the same action space coverage). How to data-efficiently/parameter-efficiently learn a Multi-Agent, or even Variable-Agent World Model, would be a very interesting exploration.

Bimanual Operation is a good place to start for multi-agent.

4. What Do World Models Mean for Embodied AI?

People who pay attention to world models can be roughly divided into two types: One type is people in the Computer Vision field who want to create very cool visual effects and ultimately revolutionize industries like games/Simulation/Rendering; The other type is people in the Embodied AI field, some are those come from the model-based RL era and always believed in this view, some are those are new-comers who expect world models to be the game-changer that breaks the data bottleneck of Embodied AI.

World models helping embodied AI is a progress that can be expected. However, if we assume that VLA²¹ ²² ²³(of course not the current VLA architecture) will be the form of embodied AI foundation models, in the ecosystem that emerges around VLA, what form will world models exist in?

State-of-the-art VLA models (like $\pi_0$) are seen as the next paradigm shift for robot learning.

My view is that world models will exist in embodied AI as foundation models, but they won’t be powerful enough to replace real-world imitation learning, and will only replace the role of simulators in some scenarios.²⁴

Simulators and World Models are two ways of modeling the physical world.

First, why do foundation embodied world models exist? Because a.) it’s technically feasible; b.) not making foundation world models has no value. World models are trained on data, and training a world model from scratch that can help policy learning requires more data than training an imitation learning policy. Therefore, for tasks that don’t require generalization, we don’t need world models. The real promise that world models offer is in scenarios requiring generalization, where we can zero-shot/few-shot obtain a world model adapted to the scenario from a pre-existing world model, which can truly fulfill the promise of “breaking the data bottleneck of embodied AI”.

VLA Models show promise of generalizing out-of-distribution.

Furthermore, we need to clearly recognize the limitations of world models. This is actually very similar to the limitations of simulation data. In scenarios where dynamics don’t have strong human priors (such as some natural science areas / under-studied real-world dynamical systems), data-driven methods (world models) may perform better than prior-driven hardcoded methods. However, in the vast majority of specific embodied AI tasks, the performance of world models can actually be upper-bounded by simulators specifically designed for that scenario. The extent to which simulated data can help policy learning is highly scenario-dependent, and in contact-rich, dexterous scenarios requiring tactile sensing, world models may prove themselves hardly useful, if not totally useless.

Where Dexterous Control succeeds, World Models shall fail. Photo Credit: Sharpa.

Personally, I believe thee next era of embodied AI should revolve around a Generalist Policy Model. World models may combine with general policy models in various ways (embedded within or attached to), but the next Embodied AI era is unlikely to revolve around world models.

Helix is a VLA model for humanoid full-upper-body control.

5. Prior-Driven vs Data-Driven: What Role Does Physics Integration Play in World Models?

Human priors and data-driven approaches are two technical approaches that have always existed, dating from the Computer Vision era. In the context of Dynamics Modeling, prior-driven means simulators, whereas data-driven means world models. I think researchers who are still uncertain about this topic should repeatedly read Rich Sutton’s The Bitter Lesson²⁵. Given sufficient data volume, data-driven methods will definitely win. But in specific task scenarios, squeezing model performance by introducing priors will be effective in the long term. At it’s core, this represents a fundamental tradeoff between generalization and performance, which the basic principles of statistical learning have told us we cannot get both.

Physics-Based vs Data-Driven is a fundamental tradeoff.

Therefore, my personal view is that learning general models through physics-informed methods is a completely wrong technical route. For general models, physical accuracy/consistency is an emergent ability brought about by increased data volume.

Snapshot of The Bitter Lesson, written by Rich Sutton.

6. How do I View JEPA-style World Models?

When it comes to world models, an unavoidable topic is Yann LeCun and the JEPA architecture²⁶ he advocates. Although Yann LeCun’s statements in many scenarios are mostly unreliable (this is actually normal, Hinton thought spiking neural networks²⁷ would be popular, but they weren’t), some of the ideas reflected behind JEPA are still quite reasonable and very profound. Our final world model form may not be a JEPA-style architecture, but the ideas of JEPA (e.g. learning in latent space) is definately something that will be a source of continuous inspiration.

JEPA, proposed by Yann Lecun, is an well-known paradigm for representation learning.

As a matter of fact, current video generation/world models are already mostly architectures that operate in latent space. The current mainstream paradigm is to use near-lossless compression methods (e.g., Stable-Diffusion’s VAE²⁸) as Encoder and Decoder, then learn a Predictor in this latent space. Lossless compression Encoders bring us lower computational costs, but whether such an Encoder-Predictor combination is optimal is actually not the case. If we replace this Encoder with Dino-v2, which can extract stronger semantic information, we can get world models that are more valuable for planning²⁹ ³⁰.

DINO-WM is a World Model that predicts in Latent Space.

In fact, what we need is a pair of Encoder and Predictor adapted to the task itself, so training the Encoder and Predictor together makes a lot of intuitive sense. JEPA’s approach of placing the loss in feature space can be understood as constructing a “game” that is highly symmetric with reinforcement learning’s actor-critic, which could potentially learn richer latents. However, the uncertainty here is quite large, after all, GANs³¹ are not the most effective method in current generative models, theory/intuition can only take us this far, and more experimental experience is needed to verify.

Screenshot from movie Oppenheimer, directed by Christopher Nolan. — Screenshot from movie *Oppenheimer*, directed by Christopher Nolan.

Finally, I’ll share my slides about JEPA and world models, using the introduction of Back to the Features: DINO as a Foundation for Video World Models³² as a starting point to review the basic ideas of JEPA and several important papers.

The original blog post was written in Chinese in Aug. 2025 (see original post here), and turns out, the JEPA language model is already here³³ in Oct. 2025.

Citation

Please cite this work as:

Huang, Siqiao. "Beyond the Hype: How I See World Models Evolving in 2025". Nemo's Blog (Oct 2025). https://knightnemo.github.io/blog/posts/wm_2025/

Or use the BibTex citation:

@article{huang2025beyond,
  title = {Beyond the Hype: How I See World Models Evolving in 2025},
  author = {Huang, Siqiao},
  journal = {knightnemo.github.io},
  year = {2025},
  month = {May},
  url = "https://knightnemo.github.io/blog/posts/wm_2025/"
}

References

Jack Parker-Holder and Shlomi Fruchter. Genie 3: A new frontier for world models, 2025. URL https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/. ↩︎ ↩︎
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning, 2025. ↩︎
NVIDIA. Cosmos World Foundation Model Platform for Physical AI, 2025. ↩︎
Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, Qinglin Lu. Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition, 2025. ↩︎
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Cyrus Wu, Wei Li, Xuchen Song, Yang Liu, Eric Li, Yahui Zhou. Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model, 2025. ↩︎
Yan Team. Yan: Foundational Interactive Video Generation, 2025. ↩︎
WorldLabs. Worldlabs blog, 2024. URL https://www.worldlabs.ai/blog, Last accessed on 2025-07-08. ↩︎
Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan. TesserAct: Learning 4D Embodied World Models, 2025. ↩︎
Yuke Zhu. The Data Pyramid for Building Generalist Agents, 2022. ↩︎
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, 2024. ↩︎
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, 2024. ↩︎
William Peebles, Saining Xie. Scalable Diffusion Models with Transformers, 2022. ↩︎
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long. Vid2World: Crafting Video Diffusion Models to Interactive World Models, 2025. ↩︎
Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, Pieter Abbeel. Learning Interactive Real-World Simulators, 2023. ↩︎
Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan. WORLDMEM: Long-term Consistent World Simulation with Memory, 2025. ↩︎
Jonathan Richens, David Abel, Alexis Bellot, Tom Everitt. General agents contain world models, 2025. ↩︎
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Zhiting Hu. Reasoning with Language Model is Planning with World Model, 2023. ↩︎
Dani Valevski, Yaniv Leviathan, Moab Arar, Shlomi Fruchter. Diffusion Models Are Real-Time Game Engines, 2024. ↩︎
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari, 2024. ↩︎
Xun Huang. Towards Video World Models, 2025. ↩︎
Gemini Robotics Team et al. Gemini Robotics: Bringing AI into the Physical World, 2025. ↩︎
Physical Intelligence: Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky. π0: A Vision-Language-Action Flow Model for General Robot Control, 2024. ↩︎
Figure AI. Helix: A Vision-Language-Action Model for Generalist Humanoid Control, 2025. URL https://www.figure.ai/news/helix. ↩︎
Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai. A Survey: Learning Embodied Intelligence from Physical Simulators and World Models, 2025. ↩︎
Rich Sutton. The Bitter Lesson, 2019. ↩︎
Yann Lecun. A Path Towards Autonomous Machine Intelligence, 2022. ↩︎
Wolfgang Maass. Networks of spiking neurons: The third generation of neural network models, 1997. ↩︎
High-Resolution Image Synthesis with Latent Diffusion Models. Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer, 2021. ↩︎
Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning, 2024. ↩︎
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis. DINO-Foresight: Looking into the Future with DINO, 2024. ↩︎
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks, 2014. ↩︎
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski. Back to the Features: DINO as a Foundation for Video World Models, 2025. ↩︎
Hai Huang, Yann LeCun, Randall Balestriero. LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures, 2025. ↩︎

1. Do World Models Need Explicit 3D Modeling?#

2. Will World Models Be the Next Big Thing?#

3. World Models: Cop or Drop?#

3.1. How to deploy World Models to the Physical World, i.e., helping Embodied AI Policy Learning through World Models.#

3.2. How to make World Models go towards Long-sequence, achieving Minute-level Temporal Memory / Consistency.#

3.3. World models that Integrate Multi-modal Signals.#

3.4. Making World Models truly Real-Time.#

3.5. Multi-Agent World Models.#

4. What Do World Models Mean for Embodied AI?#

5. Prior-Driven vs Data-Driven: What Role Does Physics Integration Play in World Models?#

6. How do I View JEPA-style World Models?#