3DGSIM: Learning 3D-Gaussian Simulators from RGB Videos

University of Tuebingen, Max Planck Institute for Intelligent Systems

Accepted Our paper has been accepted at ICML 2026!

Note An updated version of the paper (camera‑ready, with new real‑world results) is coming soon on arXiv.

Abstract

We introduce 3DGSim, a fully end-to-end 3D physics simulator. It is trained on multi-view videos to ensure both spatial and temporal consistency, all without relying on inductive biases or ground-truth 3D information during training, which can impede scalability and generalization.

It encodes images into a 3D Gaussian particle representation, utilizes a transformer to propagate dynamics, and renders frames using 3D Gaussian splatting. By jointly training inverse rendering with a dynamics transformer using a temporal encoding and merging layer, 3DGSim embeds physical properties into point-wise latent vectors without enforcing explicit connectivity constraints.

This enables the model to capture diverse physical behaviors, from rigid to elastic and cloth-like interactions, along with realistic lighting effects that also generalize to unseen multi-body interactions and novel scene edits.

past input past (novel views) future Ground truth time t Inverse renderer Multi-View depth backbone + FiLM unprojection view-independent features p, f Dynamics model TEM-PTV3 t-SPC serialization Temporal merging Patch attention + particle MLP Δp,fdyn Decoder Gaussian head Differentiable tile rasterization Predicted RGB past future time t Image reconstruction loss L₂ + λ · LPIPS 1 2 3 Encoder Dynamics Decoder Loss Gradient

Multi-view RGB videos are encoded into latent 3D particles, evolved by a Point Transformer dynamics model, and decoded back to images via 3D Gaussian Splatting. Trained end-to-end on an image reconstruction loss.

Video

Action conditioning NEW

Actions are just particles.

In 3DGSim, a small set of action particles with learned embeddings is fed into the same TEM‑PTV3 transformer alongside the state particles. There are no special branches and no per‑task heads. Going from a single arm to bimanual manipulation requires zero architectural change: just more action particles.

No particle flow or point correspondences are required for the inputs. State and action particles are just unordered sets fed to the transformer; no tracking, no matching, no per‑point flow supervision.

state particles action particles no flow / no correspondences same architecture
npast state particles unordered set, no flow nfuture action particles TEM‑PTV3 transformer shared across tasks nfuture state particles predicted future

Datasets

As part of 3DGSim we introduce three challenging datasets, each addressing distinct physical interactions and deformation characteristics.


In this dataset, the cloth is anchored at four corners, challenging the model to infer implicit constraints and effectively model dynamic deformations characteristic of cloth-like materials.

Results

Quantitative Results

Quantitative evaluation on our three datasets comparing 3DGSim with baseline methods.

Dataset Method PSNR (↑) SSIM (↑) LPIPS (↓)
Elastic 3DGSim (4-12) 33.15 ± 3.51 0.97 ± 0.02 0.02 ± 0.01
CosmosFT 26.50 ± 5.21 0.82 ± 0.02 0.067 ± 0.030
Cosmos 18.87 ± 3.99 0.79 ± 0.08 0.23 ± 0.08
Rigid 3DGSim (4-12) 28.28 ± 2.52 0.90 ± 0.03 0.09 ± 0.03
CosmosFT 26.44 ± 2.26 0.68 ± 0.05 0.104 ± 0.028
Cosmos 22.35 ± 3.82 0.83 ± 0.08 0.24 ± 0.08
Cloth 3DGSim (4-8) 26.98 ± 2.63 0.89 ± 0.03 0.08 ± 0.03
CosmosFT 22.49 ± 0.99 0.73 ± 0.03 0.141 ± 0.038
Cosmos 21.10 ± 3.56 0.86 ± 0.06 0.19 ± 0.06

*CosmosFT refers to LoRA fine-tuning of the Cosmos-Predict2 2B model on the respective dataset.

Generalizations

Here we look into scenarios beyond the training regime where we only trained on single-body & ground collision.

Editing Scenes

A key advantage of 3DGSim is its 3D representation of the simulator’s state, enabling direct scene editing for modular construction, counterfactual reasoning, and scenario exploration.

Generalization to Multi-Objects Simulations

Despite being trained only on object-ground collisions, 3DGSim correctly captures realistic multi-body dynamics. Instead of collapsing into chaotic interactions, individual objects retain structural integrity and move cohesively.

Learning Shadows as part of dynamcis

A striking consequence of removing explicit physics biases is that 3DGSim not only captures physics but also learns to reason about broader scene properties, such as shadows.

Supplementary Material

Additional comparisons and analysis with Cosmos baseline

Cosmos Results

* Cosmos results are produced by running the LoRA fine-tuned Cosmos-Predict2 2B model on each respective view seperately.

Cosmos Generalizations

In constrast to 3DGSim, CosmosFT fails to predict the object motion when ground is removed or multiple-objects are present.

For counterfactual probing, we edit the 2D images by either removing the ground or duplicating objects. Since this process is time-intensive, only a limited number of examples are provided.

Real-world HOCAP

Qualitative results: 1‑arm and 2‑arm actions.

The same simulator handles single‑arm and bimanual manipulation without changing the architecture. Switch between examples below.

One arm single‑hand manipulation
Two arms bimanual coordination

The only thing that changes between the columns is the number of action particles fed into the simulator.

BibTeX

@article{zhobro20253dgsim,
      author    = {Mikel Zhobro and Andreas René Geist and Georg Martius},
      title     = {3DGSim: Learning 3D-Gaussian Simulators from RGB Videos},
      journal   = {arXiv},
      year      = {2025},
      eprint    = {2503.24009},
      url       = {https://arxiv.org/abs/2503.24009}
    }