3DGSIM: Learning 3D-Gaussian Simulators from RGB Videos

University of Tuebingen, Max Planck Institute for Intelligent Systems
GT
3DGSim

Our 3DGSim model simulates complex dynamics using only multi-view RGB videos. It represents scenes with 3D Gaussian particles, each having its own latent feature vector, and relies solely on our TEM-PTV3 transformer—no hand-crafted priors needed.

Abstract

We introduce 3DGSim, a fully end-to-end 3D physics simulator. It is trained on multi-view videos to ensure both spatial and temporal consistency, all without relying on inductive biases or ground-truth 3D information during training, which can impede scalability and generalization.

It encodes images into a 3D Gaussian particle representation, utilizes a transformer to propagate dynamics, and renders frames using 3D Gaussian splatting. By jointly training inverse rendering with a dynamics transformer using a temporal encoding and merging layer, 3DGSim embeds physical properties into point-wise latent vectors without enforcing explicit connectivity constraints.

This enables the model to capture diverse physical behaviors, from rigid to elastic and cloth-like interactions, along with realistic lighting effects that also generalize to unseen multi-body interactions and novel scene edits.

Video

Datasets

As part of 3DGSim we introduce three challenging datasets, each addressing distinct physical interactions and deformation characteristics.


In this dataset, the cloth is anchored at four corners, challenging the model to infer implicit constraints and effectively model dynamic deformations characteristic of cloth-like materials.

Results

Quantitative Results

Quantitative evaluation on our three datasets comparing 3DGSim with baseline methods.

Dataset Method PSNR (↑) SSIM (↑) LPIPS (↓)
Elastic 3DGSim (4-12) 33.15 ± 3.51 0.97 ± 0.02 0.02 ± 0.01
CosmosFT 26.50 ± 5.21 0.82 ± 0.02 0.067 ± 0.030
Cosmos 18.87 ± 3.99 0.79 ± 0.08 0.23 ± 0.08
Rigid 3DGSim (4-12) 28.28 ± 2.52 0.90 ± 0.03 0.09 ± 0.03
CosmosFT 26.44 ± 2.26 0.68 ± 0.05 0.104 ± 0.028
Cosmos 22.35 ± 3.82 0.83 ± 0.08 0.24 ± 0.08
Cloth 3DGSim (4-8) 26.98 ± 2.63 0.89 ± 0.03 0.08 ± 0.03
CosmosFT 22.49 ± 0.99 0.73 ± 0.03 0.141 ± 0.038
Cosmos 21.10 ± 3.56 0.86 ± 0.06 0.19 ± 0.06

*CosmosFT refers to LoRA fine-tuning of the Cosmos-Predict2 2B model on the respective dataset.

Generalizations

Editing Scenes

A key advantage of 3DGSim is its 3D representation of the simulator’s state, enabling direct scene editing for modular construction, counterfactual reasoning, and scenario exploration.

Generalization to Multi-Objects Simulations

Despite being trained only on object-ground collisions, 3DGSim correctly captures realistic multi-body dynamics. Instead of collapsing into chaotic interactions, individual objects retain structural integrity and move cohesively.

Learning Shadows as part of dynamcis

A striking consequence of removing explicit physics biases is that 3DGSim not only captures physics but also learns to reason about broader scene properties, such as shadows.

Supplementary Material

Additional comparisons and analysis with Cosmos baseline

Cosmos Results

* Cosmos results are produced by running the LoRA fine-tuned Cosmos-Predict2 2B model on each respective view seperately.

Cosmos Generalizations

For counterfactual probing, we edit the 2D images by either removing the ground or duplicating objects. Since this process is time-intensive, only a limited number of examples are provided.

BibTeX

@article{zhobro20253dgsim,
      author    = {Mikel Zhobro and Andreas René Geist and Georg Martius},
      title     = {3DGSim: Learning 3D-Gaussian Simulators from RGB Videos},
      journal   = {arXiv},
      year      = {2025},
      eprint    = {2503.24009},
      url       = {https://arxiv.org/abs/2503.24009}
    }