3DGSIM: Learning 3D-Gaussian Simulators from RGB Videos

Mikel Zhobro, A. René Geist, Georg Martius

University of Tuebingen, Max Planck Institute for Intelligent Systems

Accepted Our paper has been accepted at ICML 2026!

Abstract

We introduce 3DGSim, a fully end-to-end 3D physics simulator. It is trained on multi-view videos to ensure both spatial and temporal consistency, all without relying on inductive biases or ground-truth 3D information during training, which can impede scalability and generalization.

It encodes images into a 3D Gaussian particle representation, utilizes a transformer to propagate dynamics, and renders frames using 3D Gaussian splatting. By jointly training inverse rendering with a dynamics transformer using a temporal encoding and merging layer, 3DGSim embeds physical properties into point-wise latent vectors without enforcing explicit connectivity constraints.

This enables the model to capture diverse physical behaviors, from rigid to elastic and cloth-like interactions, along with realistic lighting effects that also generalize to unseen multi-body interactions and novel scene edits.

Multi-view RGB videos are encoded into latent 3D particles, evolved by a Point Transformer dynamics model, and decoded back to images via 3D Gaussian Splatting. Trained end-to-end on an image reconstruction loss.

Video

Action conditioning NEW

Actions are just particles.

In 3DGSim, a small set of action particles with learned embeddings is fed into the same TEM‑PTV3 transformer alongside the state particles. There are no special branches and no per‑task heads. Going from a single arm to bimanual manipulation requires zero architectural change: just more action particles.

No particle flow or point correspondences are required for the inputs. State and action particles are just unordered sets fed to the transformer; no tracking, no matching, no per‑point flow supervision.

state particles action particles no flow / no correspondences same architecture

See real-world HOCAP results

Datasets

As part of 3DGSim we introduce three challenging datasets, each addressing distinct physical interactions and deformation characteristics.

In this dataset, the cloth is anchored at four corners, challenging the model to infer implicit constraints and effectively model dynamic deformations characteristic of cloth-like materials.

Results

Quantitative Results

Quantitative evaluation on our three datasets comparing 3DGSim with baseline methods.

Dataset	Method	PSNR (↑)	SSIM (↑)	LPIPS (↓)
Elastic	3DGSim (4-12)	33.15 ± 3.51	0.97 ± 0.02	0.02 ± 0.01
	CosmosFT	26.50 ± 5.21	0.82 ± 0.02	0.067 ± 0.030
	Cosmos	18.87 ± 3.99	0.79 ± 0.08	0.23 ± 0.08
Rigid	3DGSim (4-12)	28.28 ± 2.52	0.90 ± 0.03	0.09 ± 0.03
	CosmosFT	26.44 ± 2.26	0.68 ± 0.05	0.104 ± 0.028
	Cosmos	22.35 ± 3.82	0.83 ± 0.08	0.24 ± 0.08
Cloth	3DGSim (4-8)	26.98 ± 2.63	0.89 ± 0.03	0.08 ± 0.03
	CosmosFT	22.49 ± 0.99	0.73 ± 0.03	0.141 ± 0.038
	Cosmos	21.10 ± 3.56	0.86 ± 0.06	0.19 ± 0.06

*CosmosFT refers to LoRA fine-tuning of the Cosmos-Predict2 2B model on the respective dataset.

Generalizations

Here we look into scenarios beyond the training regime where we only trained on single-body & ground collision.

Editing Scenes

A key advantage of 3DGSim is its 3D representation of the simulator’s state, enabling direct scene editing for modular construction, counterfactual reasoning, and scenario exploration.

Generalization to Multi-Objects Simulations

Despite being trained only on object-ground collisions, 3DGSim correctly captures realistic multi-body dynamics. Instead of collapsing into chaotic interactions, individual objects retain structural integrity and move cohesively.

Learning Shadows as part of dynamcis

A striking consequence of removing explicit physics biases is that 3DGSim not only captures physics but also learns to reason about broader scene properties, such as shadows.

Supplementary Material

Additional comparisons and analysis with Cosmos baseline

Cosmos Results

* Cosmos results are produced by running the LoRA fine-tuned Cosmos-Predict2 2B model on each respective view seperately.

Cosmos Generalizations

In constrast to 3DGSim, CosmosFT fails to predict the object motion when ground is removed or multiple-objects are present.

For counterfactual probing, we edit the 2D images by either removing the ground or duplicating objects. Since this process is time-intensive, only a limited number of examples are provided.

Real-world HOCAP

Qualitative results: 1‑arm and 2‑arm actions.

The same simulator handles single‑arm and bimanual manipulation without changing the architecture. Switch between examples below.

One arm single‑hand manipulation

Two arms bimanual coordination

The only thing that changes between the columns is the number of action particles fed into the simulator.

BibTeX

@article{zhobro20253dgsim,
      author    = {Mikel Zhobro and Andreas René Geist and Georg Martius},
      title     = {3DGSim: Learning 3D-Gaussian Simulators from RGB Videos},
      journal   = {arXiv},
      year      = {2025},
      eprint    = {2503.24009},
      url       = {https://arxiv.org/abs/2503.24009}
    }