Synthesizing realistic 3D human movements in a given scene context is critical for building human-centric understanding of the world. Most of the work in recent times has focused on generating 3D motions from text or conditioned on the full 3D scene. Instead, we focus on the task of generating 3D human motions in 2D images. Solving this task successfully would unlock several core capabilities of perception and generation -- such as a sense of scale, affordance, free space, and plausibility -- all from a single image. Our key insight is to build a synthetic dataset by leveraging 2D diffusion models so that renderings of a 3D scene can be augmented with a variety of 2D images consistent with it. We design an end-to-end system that directly goes from pixels to human motions without any intermediate representation. Our approach predicts 3D motion in the camera-coordinate system, thereby constraining humans to lie in the image boundaries. Our approach is able to generate a variety of plausible human motions in unseen images.
@misc{tendulkar2025movi,
title={Populating Images with Moving 3D Humans},
author={Tendulkar, Purva and Sárándi, István and Pons-Moll, Gerard and Vondrick, Carl},
year={2025},
archivePrefix={arXiv},
}