Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement learning or teleoperation to achieve whole-body control, they are often limited by the diversity of simulated environments and the high costs of demonstration collection. In contrast, human videos are ubiquitous and present an untapped source of semantic and motion information that could significantly enhance the generalization capabilities of humanoid robots.
This paper introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot poses with corresponding text-based motion descriptions, designed to leverage this abundant data. Humanoid-X is curated through a comprehensive pipeline: data mining from the Internet, video caption generation, motion retargeting of humans to humanoid robots, and policy learning for real-world deployment. With Humanoid-X, we further train a large humanoid model, UH-1, which takes text instructions as input and outputs corresponding actions to control a humanoid robot.
Extensive simulated and real-world experiments validate that our scalable training approach leads to superior generalization in text-based humanoid control, marking a significant step toward adaptable, real-world-ready humanoid robots.
Humanoid-X dataset consists of 163,800 pairs of motion samples \( \langle \mathcal{V}, \mathcal{T}, \mathcal{P}_{human}, \mathcal{P}_{robot}, \mathcal{A}_{robot} \rangle \) from Internet videos:
i. We mine massive human-centric video clips \( \mathcal{V} \) from the Internet.
ii. We extract text-based action descriptions \( \mathcal{T} \) and 3D human poses \( \mathcal{P}_{human} \) from the video clips.
iii. We retarget the motions from humans to humanoid robots, resulting in humanoid keypoints \( \mathcal{P}_{robot} \) for high-level control.
iv. We employ reinforcement learning to generate physically deployable humanoid actions \( \mathcal{A}_{robot} \).
In this manner, we get the Humanoid-X dataset which is leveraged to distill a universal humanoid pose control policy.
UH-1 leverages the Transformer for scalable learning. Humanoid actions are first tokenized into discrete action tokens. Then, the UH-1 Transformer is trained, which takes text commands as inputs and auto-regressively generates the corresponding humanoid action tokens. With the input text instructions \( \mathcal{T} \), UH-1 can either generate high-level humanoid keypoints \( \mathcal{P}_{robot} \) (text-to-keypoint) for the goal-conditioned policy \( \pi \) to control the humanoid robot in closed-loop, or generate robotic actions \( \mathcal{A}_{robot} \) for direct open-loop control (text-to-action).
@article{uh1,
author = {Mao, Jiageng and Zhao, Siheng and Song, Siqi and Shi, Tianheng and Ye, Junjie and Zhang, Mingtong and Geng, Haoran and Malik, Jitendra and Guizilini, Vitor and Wang, Yue},
title = {Learning from Massive Human Videos for Universal Humanoid Pose Control},
journal = {arXiv preprint arXiv:2412.14172},
year = {2024},
}