Populating Images with Moving 3D Humans

Abstract

Synthesizing realistic 3D human movements in a given scene context is critical for building human-centric understanding of the world. Most of the work in recent times has focused on generating 3D motions from text or conditioned on the full 3D scene. Instead, we focus on the task of generating 3D human motions in 2D images. Solving this task successfully would unlock several core capabilities of perception and generation -- such as a sense of scale, affordance, free space, and plausibility -- all from a single image. Our key insight is to build a synthetic dataset by leveraging 2D diffusion models so that renderings of a 3D scene can be augmented with a variety of 2D images consistent with it. We design an end-to-end system that directly goes from pixels to human motions without any intermediate representation. Our approach predicts 3D motion in the camera-coordinate system, thereby constraining humans to lie in the image boundaries. Our approach is able to generate a variety of plausible human motions in unseen images.

Dataset Curation

Given a 3D scene and a human motion in it (from the TRUMANS dataset), we run an optimization to find valid camera positions and orientations such that the human trajectory is fully visible and non-occluded. We then render the depth image using the selected viewpoints (without the human) and pass this as a control signal to ControlNet to obtain a number of realistic-looking RGB images consistent with the depth map. Here you see four camera views with associated depth images as well as ControlNet-generated images.

Qualitative Results (Unconditional)

MOvI can generate a variety of plausible human motions in unseen images without any conditioning on the start and/or end position or pose.

@misc{tendulkar2025movi, title={Populating Images with Moving 3D Humans}, author={Tendulkar, Purva and Sárándi, István and Pons-Moll, Gerard and Vondrick, Carl}, year={2025}, archivePrefix={arXiv}, }

Populating Images with Moving 3D Humans

We introduce MOvI, a framework to generate 3D human motions in image camera-coordinates. MOvI can be used to populate images with moving 3D humans, enabling applications such as image editing, video generation, and human-object interaction.

Abstract

Video

Dataset Curation

Qualitative Results (Unconditional)

BibTeX