Learning Character-Agnostic Motion for Motion Retargeting in 2D


Analyzing human motion is a challenging task with a wide variety of applications in computer vision and in graphics. One such application, of particular importance in computer animation, is the retargeting of motion from one performer to another. While humans move in three dimensions, the vast majority of human motions are captured using video, requiring 2D-to-3D pose and camera recovery, before existing retargeting approaches may be applied. In this paper, we present a new method for retargeting video-captured motion between different human performers, without the need to explicitly reconstruct 3D poses and/or camera parameters. In order to achieve our goal, we learn to extract, directly from a video, a high-level latent motion representation, which is invariant to the skeleton geometry and the camera view. Our key idea is to train a deep neural network to decompose temporal sequences of 2D poses into three components: motion, skeleton, and camera view-angle. Having extracted such a representation, we are able to re-combine motion with novel skeletons and camera views, and decode a retargeted temporal sequence, which we compare to a ground truth from a synthetic dataset. We demonstrate that our framework can be used to robustly extract human motion from videos, bypassing 3D reconstruction, and outperforming existing retargeting methods, when applied to videos in-the-wild. It also enables additional applications, such as performance cloning, video-driven cartoons, and motion retrieval.
[ Paper ]     [ Video ]     [ Code ]

Motion Retargeting in 2D

Our approach is to extract an abstract, character- and camera-agnostic, latent representation of human motion directly from ordinary video. The extracted motion may then be applied to other, possibly very different, skeletons, and/or shown from new viewpoints, which can be extracted as well from other videos.

Decomposing and Re-composing

We train a deep neural network to decompose 2D projections of synthetic 3D data into three latent spaces: motion, skeleton and camera view-angle, which are then shuffled and re-composed to form new combinations.

Skeleton and View-Angle Retargeting

Retargeting of similar motion to various skeletons (left) and different view-angles (right), without the need for 3D reconstruction.


Interpolation of view-angle (horizontal axis) and motion (vertical axis).

Video Performance Cloning

The ability to perform motion retargeting in 2D enables one to use a video-captured performance to drive a novel 2D skeleton, with possibly different proportions. This is done by using recent performance cloning techniques that propose a deep generative networks to produce frames that contain the appearance of a target actor reenacting the motion of a driving actor.

Motion Retrival

Using our motion representation, we can search in a dataset of videos in-the-wild for motions similar to one in a video given as a query, with the search being agnostic to the body proportions of the individual and the camera view angle.