The Humanoid Robot Motor Cortex: A Reinforcement Learning Trained Controller

This is an online article from Professor Alan Fern of Oregon State University, dated August 25, titled “The Emerging Humanoid Motor Cortex: An Inventory of RL-Trained Controllers”.

Human coordinated movement does not solely originate from a conscious planning center. When a person decides to cross a room or reach for a cup, it is not the prefrontal cortex (the planner of the brain) that determines how to activate each muscle. Instead, this task falls to the motor cortex—a specialized set of areas in the brain that translates high-level intentions into precise, real-time muscle activations to maintain balance and move smoothly. This division of labor allows the brain’s planning areas to focus on goals and strategies, while the motor cortex handles the execution of fast, responsive actions.

Humanoid robots face the same fundamental challenge. High-level planners can choose goals and trajectories, but they cannot directly manage the millisecond-level joint torques and balance adjustments required to execute these plans in the real world.The Whole-Body Controller (WBC) was born out of necessity. The WBC functions similarly to the human motor cortex, acting as the execution layer for the robot, connecting low-level coordinated movement signals with abstract commands (such as root velocity, joint targets, or keypoint trajectories). The WBC operates in real-time and must also handle unexpected disturbances and the inherent uncertainties of the physical world, ensuring the robot remains stable and responsive.

The analogy to the motor cortex suggests that humanoid AI should include a clear WBC. However, mere analogy is not enough. In an era where painful lessons often lead to the adoption of minimally structured end-to-end architectures, it is worth pondering: is a clear WBC really necessary? Or is an end-to-end architecture sufficient?

Why a clear WBC?

In the past 18 months, a surge of humanoid robot demonstrations and papers has emerged from academia and industry, covering walking, running, dancing, martial arts, vacuuming, grocery handling, package sorting, and more. These systems employ various AI architectures, many of which involve machine learning. So, do these systems truly contain explicit whole-body controllers?

In most cases, the answer is affirmative. Consider a very common type of demonstration where the humanoid robot’s movements are controlled by a remote operation interface or by human motion captured or video-relocalized. In these setups, the entire AI stack is effectively a WBC: its sole task is to generate low-level movement commands that can stably and accurately track the target motion.

Another class of demonstrations involves autonomous task execution, such as package sorting or pick-and-place. Typically, these systems are trained based on human remote operation data, relying on the WBC to produce stable robot movements based on remote operation inputs during data collection. Thus, the WBC is at least involved in the extensive data collection efforts. But what about the execution architecture during testing?

The actual testing architecture may or may not include an explicit WBC. For example, Stanford University’s HumanPlus (https://humanoid-ai.github.io/) hierarchical architecture includes an explicit frozen WBC at the lowest level. The WBC is first trained and then used to collect demonstrations. These demonstration data are used to train higher-level models that replace the human remote operator by autonomously generating inputs to the WBC.

Another architecture, such as Figure’s Helix, is an end-to-end two-tier hierarchical structure where high-level vision-language models operate at a slower control rate, while low-level policies generate movement commands at a faster rate. In these cases, the low-level policy essentially acts as the WBC, although it is trained jointly with the rest of the system rather than as a separate reusable module.

The appeal of end-to-end approaches lies in the fact that, with sufficiently diverse remote operator data, low-level policies can generalize broadly without additional engineering design. But risks accompany this: strategies trained solely on demonstration may lack the stability and robustness against disturbances that a well-trained and tested WBC possesses. For instance, how can such strategies learn to recover from unexpected pushes if these events are not present in the dataset? If a robust WBC is needed to provide these examples during remote operation, why not keep it in the final stack?

Currently, explicitly trained WBCs remain valuable—both as a powerful execution layer and as components that can be independently tested without retraining the entire humanoid AI for improvements. However, we are still in the early stages, and practical experience will ultimately provide more guidance.

Why learn about WBC?

Fixed-base and wheeled robots are inherently stable, and their underlying control can typically rely on standard, model-based methods. High-level planners specify end-effector trajectories, and simple low-level controllers (e.g., inverse kinematics + PD) convert them into smooth joint torques—without worrying about balance issues. In contrast, humanoid robots are inherently unstable bipedal robots. Their low-level controllers must actively manage balance and whole-body coordination to remain stationary, let alone walk or manipulate objects.

Model-based whole-body control (WBC) for humanoid robots has been around for a long time, with its construction techniques primarily revolving around inverse dynamics, reduced-order models, and model predictive control. While these methods have achieved impressive results—most notably with companies like Boston Dynamics (BD) showcasing their engineering prowess—they have not achieved comparability in broader research fields. Aside from a few engineering efforts with significant investment, model-based control has yet to reach the robustness, flexibility, or ease of deployment of learning-based controllers.

This is why learning-based whole-body controllers are becoming increasingly popular—typically trained through reinforcement learning in simulation and then transferred to real hardware. Learning-based whole-body controllers can internalize complex dynamics, balance strategies, and the rapid reactive behaviors required for smooth motion, which are difficult to encode explicitly. The result is a level of performance and generality that traditional methods struggle to achieve without substantial engineering investment.

It is worth mentioning that another reason learning-based methods are favored over model-based methods is that the barrier to entry is lower. Building model-based systems requires deep expertise in control theory, dynamics modeling, and optimization, while learning-based processes can often be constructed without this background. This does not mean that traditional skills become irrelevant, as understanding these fundamentals is still crucial for shaping intuition and diagnosing problems. However, whether humanoid robots can ultimately be “solved” through handcrafted models and controllers is a question worth exploring. The computational complexity of fully functional humanoid robots may exceed what can be captured through analysis, at least in the short term. The real challenge is not to derive the perfect equations but to learn how to train systems that can autonomously master these complexities most effectively.

As shown in the table: a summary of key features of 18 reinforcement learning-trained WBCs

The Humanoid Robot Motor Cortex: A Reinforcement Learning Trained Controller

Reinforcement learning-trained whole-body controller (WBC) components

Recently, there has been a surge in research on learning-based whole-body controllers from simulation to reality. These systems are trained entirely in simulated environments and then transferred to real humanoid robots. This approach now feels less like an experiment and more like the direction of development for the WBC subfield.

To help understand this rapidly evolving field, a list of learning-based WBC components is provided below, categorized by their functionality, architecture, and training strategies:https://docs.google.com/spreadsheets/d/1wPq9GXhw_By0BKTV85FzhE6ibbOe5un_dITvMGQ0Zp4/edit?gid=0#gid=0

To maintain a focused research scope, each WBC in the list meets five criteria:

1. Demonstrated on a real humanoid robot (not just in simulation).
2. Documented in a reasonably detailed technical paper.
3. Trained through deep learning, typically combining reinforcement learning and supervised learning.
4. Controls all actuated joints, rather than delegating body parts to traditional model-based controllers (hands excluded).
5. Serves as a general motion interface rather than being tuned for a single skill or task.

Among the 18 systems, 14 support both locomotion and some form of manipulation (loco-manipulation): Hover; MHC; R2S2; GMT; ExBody2; TWIST; HST (HumanPlus); ExBody; FALCON; ARMOR; VMP; CLONE; OmniH20. The remaining 4 focus on locomotion: H20; SaW; UCB-Loco; VideoMimic; ETH-Loco. These systems still meet the criteria for humanoid WBCs, as even “pure” locomotion requires coordinated whole-body movements, including adjustments of the torso and arms to maintain stability.

The table roughly illustrates data covering key aspects of WBCs. The goal of this article is to provide a simple tutorial on how these reinforcement learning-trained WBCs are constructed and operated, highlighting some trends and gaps that have emerged in the current landscape.

Robot Platforms

First, information about the robot platforms. The trend is clear: among the 18 controllers, 12 run on Unitree hardware: 6 on G1, 5 on H1, and 1 on both G1 and H1. Next is Agility Robotics‘ Digit V3, which appears in 3 entries.

Unitree’s dominance in this list is not surprising. Before the advent of the H1, researchers had little opportunity to access humanoid robots unless they had deep ties with companies or engineered designs themselves. Most commercial humanoid robot developers either do not sell products to academic institutions or focus on strictly controlled pilot projects. In contrast, Unitree’s humanoid robots are easy to order, relatively affordable, and fully functional enough to meet the needs of serious research. While the G1 is priced as low as $16,000, a fully equipped robot may cost between $40,000 and $50,000, roughly equivalent to what many labs invest in a mid-sized GPU server.

Equally important, the G1 and H1 are easy to simulate. Both feature a clean design and widely available physical models, making them easy to integrate into engines like MuJoCo and NVIDIA’s Isaac for simulation-to-reality training. This combination of affordability and digital accessibility has lowered the barrier to humanoid robot research more than ever.

A broader trend is evident: as more low-cost humanoid robots enter the market, the pace of advancements in learning-based whole-body control and more advanced humanoid AI will only accelerate. However, this accessibility will also bring a lot of noisy and uneven quality work, as seen across the entire AI field.

WBC API: Inputs/Outputs

The WBC provides an API to connect high-level motion commands with the robot’s low-level motor control. The figure illustrates the typical layout of the WBC API within the entire robot control stack. The table shows the input and output options of this API.

WBC Outputs

The table shows that each WBC outputs joint position setpoints for the robot’s actuated joints (i.e., target motor positions), with a control rate consistently set at 50 Hz (specified frequency).

These setpoints do not directly drive the motors but are input into a proportional-derivative (PD) controller running at approximately 1000 Hz, which converts the target positions into motor torques. The PD gains are manually tuned and kept constant, providing a fast and predictable layer for torque-level control.

Why adopt this two-stage approach instead of allowing the WBC to output torques directly? Part of the reason lies in the continuity with earlier work—this structure is commonly used in physics-based character animation and early successful cases of learning-based legged motion. But there are also practical engineering reasons:

1. Hardware limitations: Smooth contact-reactive motor control typically requires a driving frequency of about 1000 Hz. Given the onboard computational limitations, running a neural network WBC at that frequency is highly challenging. By using a PD layer as a buffer, the WBC can operate at 50 Hz, allowing for greater computational capacity.

2. Reinforcement learning efficiency: All WBCs in the inventory have been at least partially trained using reinforcement learning, and reducing the decision rate from 1000 Hz to 50 Hz can decrease the number of decisions per episode by a factor of 20. This not only reduces the number of neural network calls during training but also shortens the duration of reinforcement learning, simplifies credit assignment, and accelerates learning speed. The PD layer plays a second, more subtle role: it acts like a shock absorber, smoothing out the noisy and contact-rich dynamic information from the real world before it reaches the WBC. By buffering the WBC from the full impact of these disturbances, the PD controller makes the RL problem easier to solve.

This medium-rate generation of joint setpoints, stabilized by a high-rate PD layer, has become the default template for learning-based humanoid WBCs. Whether this choice will expose limitations, especially in applications requiring precise force control operations, remains to be seen.

Input 1: Robot State

Each WBC takes some representation of the robot’s kinematic and dynamic state as input, but the specific signals vary. All 18 controllers rely at least on proprioceptive data from the robot’s joint encoders and inertial measurement units (IMUs). Specifically, each WBC uses the following information:

Joint angle positions and velocities
Root gravity vector (or equivalently, the robot’s roll and pitch)
Root angular velocity

These quantities can be reliably estimated through onboard encoders and IMUs.

However, simply providing the current state is insufficient to fully capture the robot’s dynamics. Information about acceleration or higher-order motion cannot be inferred from a single snapshot, so WBCs employ several different strategies to recover this information:

1. Proprioceptive history: Many controllers include a window of recent proprioceptive states in their inputs, providing temporal context for inferring acceleration. The length of the history varies by implementation, ranging from 25 to 1 (no history).
2. Action history: Some WBCs also include a history of their own recent outputs/actions (joint setpoints), which helps the policy disambiguate dynamic effects such as momentum or actuator lag. The history window ranges from 25 to 0 (no actions). Among these, 13 WBCs include both proprioceptive and action histories.
3. Internal memory: Two systems use WBC architectures with built-in recurrence (e.g., LSTM), allowing them to retain compressed internal representations of historical data rather than relying entirely on explicit history windows.

Overall, the table showcases various strategies for capturing the robot’s dynamic state. Only one WBC relies entirely on the current proprioceptive state without including actions or temporal context (though this may reflect incomplete descriptions). The other 17 WBCs include state histories, action histories, or both, with 13 using both. This indicates that some form of temporal context is generally important for capturing the dynamics of the robot, but the specific configurations—history length, action inclusion, or reliance on recurrence—are often treated as adjustable hyperparameters and pragmatically chosen based on effective methods discovered in research.

A smaller subset (4 WBCs) also incorporates linear root velocity as part of the state. However, it is well known that estimating linear velocity for legged robots using IMU data is very challenging and noisy. In practical evaluations, these WBCs rely on external motion tracking systems to provide velocity estimates, which poses a significant limitation for real-world deployment.

Two WBCs designed solely for locomotion go further by integrating local terrain height maps to help them adapt to rugged terrain. One height map is derived from prior modeling and combined with motion capture localization; the other height map is reconstructed online using lidar or depth sensing. The remaining WBCs focus on relatively flat terrain, where blind walking suffices.

In fact, 16 of the 18 WBCs are “blind” to the external world, relying entirely on proprioception and delegating any visual or environmental reasoning to higher-level control. This raises an important design question: how much external sensory input should WBCs handle, and for what purposes?

The answer does not seem to be zero. For example, when climbing stairs, should the high-level planner micromanage foot placement, or should the WBC handle basic, “common-sense” step selection if given a height map? Should all collision avoidance tasks (even for local small obstacles) be pushed to the high-level planner? Or should the WBC have enough spatial awareness to make local passive adjustments? The challenge lies in how to endow the WBC with sufficient external perception capabilities to handle local passive decisions while keeping its reasoning shallow, leaving more abstract planning to higher levels.

Input 2: Target Motion Commands

From the perspective of the WBC API, the most important category of input is the target motion commands, which specify the higher-level intentions that the WBC must achieve. These commands can be expressed at very different levels of abstraction, and this choice not only determines what the WBC must compute internally but also which modalities can drive it.

For instance, controllers driven by VR teleoperation benefit from simple interfaces: root velocity and end-effector targets, while the WBC is responsible for filling in the rest of the body’s motion. In contrast, when the goal is to reproduce human motion from motion capture or video, the command input must be richer—often requiring the specification of full-body joint angles or key body positions to faithfully capture the demonstration. The level of abstraction reflects both the upstream source (teleoperation, planner, motion redirection) and the expected role of the WBC, whether as a task-oriented execution layer or a faithful motion reproducer.

Among the 18 controllers, there are primarily four command modes (many WBCs support multiple modes):

1. Root velocity targets (9 WBCs): This mode supports joystick-like motion control, where the robot’s root is assigned target linear and angular velocities. This abstraction is very valuable as it completely delegates the responsibility of gait generation, balance management, or whole-body posture to the WBC.
2. Root posture targets (11 WBCs): Most operational controllers allow direct specification of reference orientations or inclinations, which are crucial for tasks such as tilting to reach objects or adjusting posture during operations.
3. Full-body posture targets (8 WBCs): These controllers can track full-body configurations represented by body keypoints (pelvis, feet, hands, head) or complete joint angles. Keypoints are particularly convenient for motion capture and video input, as they naturally produce keypoints. However, for WBCs that only accept joint positions, keypoints can be converted to joint positions through inverse kinematics, which adds computational cost upstream.
4. Upper body posture or end-effector targets (6 WBCs): Among controllers that do not track full-body posture, most still support upper body posture or hand targets (via keypoints or joints). Three of these controllers combine this with root velocity control for locomotion, while others implicitly generate motion as needed to achieve upper body targets. From the perspective of providing a high-level manipulation interface, supporting a combination of upper body posture + root velocity may be the best choice in practical applications.

In addition to these modalities, multimodality is also an important feature. While 15 WBCs use fixed command input formats, three controllers allow the command interface to vary: the same WBC can be driven by simple root velocity commands for navigation or by detailed keypoint or joint targets for full-body precision. In contrast, some controllers require full-body posture targets and root velocity to be obtained simultaneously, limiting their utility as purely “joystick” abstractions, as upstream systems must continuously generate full-body targets. From the API perspective, multimodal flexibility is very appealing, as it allows a single WBC to span multiple roles—from coarse executors to detailed reproducers—without retraining or switching between different WBCs.

Regardless of the modality used, all WBCs operate as dense, high-rate trackers, receiving target motion inputs at approximately 50 Hz. They are not designed for autonomously planning movements toward distant targets but rather follow detailed real-time references from upstream sources. Most WBCs only accept the next frame target, but three also handle short-term future trajectories, achieving smoother tracking through motion prediction. However, this target motion preview is only practical when future targets are available, making it unsuitable for real-time teleoperation, as human input arrives frame by frame in real-time teleoperation.

Overall, the choices of these command interfaces largely reflect the types of demonstrations or upstream controls each WBC is intended to support. Task-oriented demonstrations (e.g., moving objects or walking between waypoints) typically rely on low-dimensional inputs such as root velocity and end-effector posture, while motion capture imitation requires high-dimensional full-body targets to capture details. Although multimodal interfaces remain a minority, as WBCs are deployed in broader applications, they are likely to become the norm, where both simple and fine control are essential.

Finally, today’s WBCs delegate all motion planning to higher-level controllers, even for seemingly common-sense actions like reaching for nearby objects. Looking ahead, we may see WBCs evolve to handle sparser, lower temporal density targets, freeing higher-level reasoning layers (e.g., vision-language models) to focus on the semantic aspects of tasks rather than micromanaging routine actions.

All WBCs in the inventory are trained entirely in simulated environments, with the generated policies transferred to real hardware. While the simulation-to-reality transfer involves a series of techniques such as domain randomization and external force perturbations, this article will focus on how the controllers are trained in simulated environments, as most design choices occur in simulation.

Why choose reinforcement learning?

Reinforcement learning (RL) is the foundation for training all WBCs in the inventory, as large-scale supervised data for low-level actions simply does not exist. While humans can demonstrate actions, they cannot directly label the precise sequences of joint setpoints (WBC outputs) required to achieve those actions, which is necessary for supervised learning. However, the quality of actions can be evaluated: if the controller attempts to follow the target commands, the degree of match between the final action and the intention can be measured. Viewing this as a reward signal naturally leads to RL, where the controller learns through trial and error to maximize rewards across multiple simulated scenarios.

Each WBC’s reinforcement learning training process shares a basic structure:

1. Scene generation: Construct distributions of target motion commands, initial robot states, and terrains, then sample scenes where the robot must follow the specified commands.
2. Reward design: Define a reward function that encourages the robot to match the target commands while maintaining balance, smoothness, and energy efficiency.
3. Policy optimization: Train the controller using reinforcement learning algorithms, typically employing proximal policy optimization (PPO) due to its stability and scalability.

The following sections detail each step.

Episode Generation

Among the 14 local manipulation WBCs, 13 utilize human motion capture datasets for scene generation, which are relocalized to the desired robot body. Each scene samples a motion segment, initializes the robot at a random frame, and generates target commands based on that segment. For example, a motion capture segment of a person walking may generate a series of target root velocities or body keypoints for the WBC to track, depending on the selected WBC command mode. The most commonly used dataset (9 WBCs) is AMASS, which contains 40 hours of motion data involving hundreds of subjects and over 11,000 actions. The second most used is the CMU Motion dataset (4 WBCs). Additionally, many WBCs use their own motion capture data and/or augmented data from existing datasets to supplement the existing data.

In contrast, the four pure locomotion WBCs and one local manipulation WBC do not use offline motion capture data for training. Instead, they sample random root velocity commands (e.g., forward speed, turning, sidestepping) and rely on reward shaping to discover stable gaits that can match the velocity commands. Similarly, for local manipulation, random target end-effector positions can be sampled as training commands.

To improve efficiency, most of this training occurs on flat or nearly flat ground. However, at least 2 WBCs mix in procedurally generated terrains, such as ramps, stairs, and uneven surfaces, to enhance terrain robustness.

It remains unclear how to balance offline datasets with synthetically generated ones. Large motion capture datasets like AMASS provide diversity and realism but often lack the specific task motions and reactive behaviors needed for practical applications. In contrast, procedurally generated scenes or custom motion captures can be carefully designed to densely cover the required task space. Ultimately, designing the right training distribution may be the most critical factor in enhancing WBC capabilities.

Reward Design

While the details vary, most WBC reward functions balance three core components:

1. Tracking accuracy: The degree to which the robot’s motion matches the target motion commands. For example, if the target command modality is root velocity, the reward will be maximized when the robot moves at the correct speed. Conversely, if the target is full-body keypoint tracking, the reward is maximized when the actual robot keypoints align perfectly with the target keypoints.
2. Stability and balance: Penalties for falling, excessive tilting, or rapid angular velocities.
3. Smoothness and efficiency: Suppressing jerky movements and excessive torque consumption to promote natural, energy-efficient gaits.

Since bipedal motion is inherently periodic, it is difficult to discover through pure exploration, so 15 WBCs include explicit gait shaping rewards based on techniques from earlier motion studies. For instance, foot airborne time rewards encourage alternating foot contact—ensuring that when one foot lands, the other is in motion, while avoiding prolonged standing poses.

Reward design remains one of the most time-consuming and ad-hoc aspects of WBC training. These multi-component reward functions require careful tuning of weights, but the impact of each choice often only becomes apparent after long and costly reinforcement learning runs. The results are highly sensitive, and due to the lack of principled methods for setting these parameters, reward shaping often feels both like an art and a science. Developing better tools and methods could significantly improve the daily workflows of WBC engineers.

Policy Optimization

All 18 WBCs use PPO (Proximal Policy Optimization) as the core reinforcement learning algorithm. The popularity of PPO partly stems from its stability, scalability, and ease of parallelization, but also from its historical momentum. Many labs have built expertise and infrastructure around PPO for legged motion. While the performance of other reinforcement learning algorithms may match or even exceed that of PPO, PPO remains the mainstay because it runs reliably and has not become a bottleneck in WBC development.

Beyond pure reinforcement learning: Teacher-Student Pipeline

While reinforcement learning is the foundation of all WBC training pipelines, 9 of the 18 controllers use a teacher-student framework to simplify the learning process and enhance robustness. This process is divided into two stages:

1. First, a teacher policy is trained using reinforcement learning, but the teacher policy has access to privileged information that the final controller cannot access (e.g., complete world state estimates, ground truth contacts, or ranges of future motion commands). This information is designed to significantly simplify the reinforcement learning problem, especially for behaviors involving complex contacts or rapid dynamics, where partial observability can slow down or destabilize learning.
2. Then, a student policy is trained through supervised imitation learning to match the actions of the teacher policy, using only the limited observations available at runtime. Most processes adopt DAGGER (Dataset Aggregation)—running the student model in the simulated environment, querying the teacher model for the “correct” action at each state, and training the student model through supervised loss (typically mean squared error) to minimize the deviation from the teacher model.

The teacher-student model pipeline has clear advantages: it decomposes the problem, allowing reinforcement learning to focus on mastering behavior under ideal conditions, while the student model handles the constraints of simulation-to-reality, such as limited perception. However, this introduces additional complexity—both in designing the teacher model and tuning the distillation process. Notably, the same number of WBCs have achieved success using pure reinforcement learning, indicating that the optimal training process still highly depends on the task and team. As WBCs evolve toward greater generality and performance, hybrid models like curriculum design and teacher-student models may play an increasingly significant role.

Network Architecture

Compared to the complex and highly optimized designs common in fields like computer vision or language modeling, the network architectures used for WBCs remain relatively simple and small in scale. Half of the controllers in the inventory (9 out of 18) rely on simple multi-layer perceptrons (MLPs) that directly map the robot’s state and motion commands to joint setpoints.

Nevertheless, some WBCs extend the basic MLP to address issues of temporal context, input compression, or behavioral diversity:

• Variational Autoencoders (VAEs) (2 WBCs): Compress high-dimensional inputs (e.g., joint states or keypoints) into latent space before passing them to the MLP, thereby regularizing learning and filtering noise.
• Recurrent layers (2 WBCs): LSTM-based memory layers provide implicit temporal context, reducing the need for long input histories and helping the controller infer dynamic states such as acceleration and momentum, resulting in smoother motion.
• Mixture of Experts (2 WBCs): A small group of MLP “experts,” each focusing on different motion modes (e.g., walking vs. standing), with a gating network selecting or mixing outputs.
• Transformers or attention layers (3 WBCs): Used to integrate temporal or multimodal inputs, enabling the controller to dynamically weigh past states or key features.

Despite these experiments, the field has yet to reach any deep specialization or carefully designed network architectures for WBCs. Most WBCs remain relatively lightweight, prioritizing inference speed, robustness, and PPO training stability over architectural novelty. Currently, training methods (motion capture segments, teacher-student processes, domain randomization) seem to influence performance more than the network backbone itself. That said, as the functionality and robustness of WBCs continue to improve, new architectures beyond simple MLPs are likely to emerge as preferred standards.

Conclusion

Humanoid AI is still in its early stages, but the forms of whole-body controllers (WBCs) are beginning to emerge, with clear trends in their inputs, training, and architectures. However, it remains challenging to draw definitive conclusions about which approaches “work best”: each WBC relies on extensive tuning, experimentation, random seeds, and adaptations for specific robots. Current comparative results provide limited insights, often reflecting both the implementation work and the underlying designs. Truly understanding which choices are important and which are incidental will take time and collective experimentation. At present, real progress is more likely to come from careful ablation studies and systematic explorations of individual systems rather than chasing state-of-the-art scores, gradually revealing the principles that will shape the next generation of WBCs.

Looking ahead, several significant questions may determine the next steps for WBCs. To what extent should they “observe” the world? Should they continue to function purely as stabilizers, or begin to handle basic functions such as foot placement and collision avoidance, allowing higher-level planners to focus on the global? Multimodality and flexibility have also matured and can make progress: imagine a controller that can seamlessly switch between joystick-like navigation, full-body motion tracking, and even achieving simple long-term goals. This capability will almost certainly propel the field beyond today’s primarily MLP-based networks. Next is the data question: will large motion capture datasets like AMASS continue to dominate? Or can carefully designed synthetic training data perform just as well, or even better? Regardless of the outcome, the answers to these questions will determine the direction of the next wave of humanoid robot control.

References

[1]“Real-World Humanoid Locomotion with Reinforcement Learning”, arXiv 2303.03381, 2023

[2]“Expressive Whole-Body Control for Humanoid Robots”, arXiv 2402.16796, 2024

[3]“Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation”, arXiv 2403.04436, 2024

[4]“Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking”, arXiv 2404.19173, 2024

[5]“OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning”, arXiv 2406.08858, 2024

[6]“Masked Whole-Body Controller for Humanoid Robots”, arXiv 2408.07295, 2024

[7]“VMP: Versatile Motion Priors for Robustly Tracking Motion on Physical Characters”, ACM SIGGRAPH / Eurographics Symposium on Computer Animation (SCA‘24), 2024

[8]“HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots”, arXiv 2410.21229, 2024

[9]“HumanPlus: Humanoid Shadowing and Imitation from Humans”, CoRL’24, 2024

[10]“ExBody2: Advanced Expressive Humanoid Whole-Body Control”, arXiv 2412.13196, 2024

[11]“TWIST: Teleoperated Whole-Body Imitation System”, arXiv 2505.02833, 2025

[12]“Visual Imitation Enables Contextual Humanoid Control”, arXiv 2505.03729, 2025

[13]“FALCON: Learning Force-Adaptive Humanoid Loco-Manipulation”, arXiv 2505.06776, 2025

[14]“Unleashing Humanoid Reaching Potential via Real-world-Ready Skill Space”, arXiv 2505.10918, 2025

[15]“CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks”, arXiv 2506.08931, 2025

[16]“Attention-Based Map Encoding for Learning Generalized Legged Locomotion”, arXiv 2506.09588, 2025

[17]“GMT: General Motion Tracking for Humanoid Whole-Body Control”, arXiv 2506.14770, 2025

[18]“AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning”, ACM SIGGRAPH 2025

Related posts

Leave a Comment Cancel reply