news

Game Changer for Robot Strategy Learning? Berkeley proposes Body Transformer

2024-08-19

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Editor: Panda

In the past few years, the Transformer architecture has achieved great success, and it has also spawned a large number of variants, such as the Vision Transformer (ViT), which is good at handling visual tasks. The Body Transformer (BoT) to be introduced in this article is a Transformer variant that is very suitable for robot strategy learning.

We know that physical agents often respond spatially to the location of external stimuli they perceive when performing correction and stabilization of movements. For example, the response circuits to these stimuli in humans are located at the level of spinal cord neural circuits, which are responsible for the response of individual actuators. Corrective local execution is the main factor for efficient movement, which is particularly important for robots.

However, previous learning architectures have typically not modeled the spatial associations between sensors and actuators. Given that robotic policies use architectures developed for natural language and computer vision, they often fail to effectively exploit the structure of the robot’s body.

However, Transformer still has great potential in this regard. Studies have shown that Transformer can effectively handle long sequence dependencies and can easily absorb large amounts of data. The Transformer architecture was originally developed for unstructured natural language processing (NLP) tasks. In these tasks (such as language translation), the input sequence is usually mapped to an output sequence.

Based on this observation, a team led by Professor Pieter Abbeel at the University of California, Berkeley, proposed the Body Transformer (BoT), which increases attention to the spatial location of sensors and actuators on the robot body.



  • Paper title: Body Transformer: Leveraging Robot Embodiment for Policy Learning
  • Paper address: https://arxiv.org/pdf/2408.06316v1
  • Project website: https://sferrazza.cc/bot_site
  • Code address: https://github.com/carlosferrazza/BodyTransformer

Specifically, BoT models the robot body as a graph, where the nodes are its sensors and actuators. It then uses a highly sparse mask on the attention layer to prevent each node from paying attention to parts outside its immediate neighbors. Connecting multiple BoT layers with the same structure can aggregate information from the entire graph without compromising the representational power of the architecture. BoT performs well in both imitation learning and reinforcement learning, and is even considered by some to be a "Game Changer" for strategy learning.

Body Transformer

If the robot learning strategy uses the original Transformer architecture as the backbone, it will usually ignore the useful information provided by the robot's body structure. But in fact, this structural information can provide a stronger inductive bias for the Transformer. The team used this information while retaining the representation power of the original architecture.

The Body Transformer (BoT) architecture is based on masked attention. In each layer of this architecture, a node can only see information about itself and its immediate neighbors. This way, information flows according to the structure of the graph, where upstream layers perform reasoning based on local information, and downstream layers can aggregate more global information from more distant nodes.



As shown in Figure 1, the BoT architecture consists of the following components:

1.Tokenizer: projects sensor input into corresponding node embeddings;

2. Transformer encoder: processes the input embedding and generates output features of the same dimension;

3.detokenizer: detokenizes, i.e. decodes features into actions (or values ​​for reinforcement learning criticism training).

tokenizer

The team chose to map the observation vector into a graph of local observations.

In practice, they assign global quantities to the root element of the robot body and local quantities to the nodes representing the corresponding limbs, similar to previous GNN methods.

Then, a linear layer is used to project the local state vector into an embedding vector. The state of each node is fed into its node-specific learnable linear projection, resulting in a sequence of n embeddings, where n represents the number of nodes (or sequence length). This is different from previous research works, which usually only use a single shared learnable linear projection to handle different numbers of nodes in multi-task reinforcement learning.

BoT Encoder

The backbone network used by the team is a standard multi-layer Transformer encoder, and there are two variants of this architecture:

  • BoT-Hard: Mask each layer with a binary mask that reflects the structure of the graph. Specifically, they construct the mask as M = I_n + A, where I_n is the n-dimensional identity matrix and A is the adjacency matrix corresponding to the graph. Figure 2 shows an example. This allows each node to see only itself and its immediate neighbors, and can introduce considerable sparsity to the problem - which is particularly attractive from a computational cost perspective.



  • BoT-Mix: Interweaves layers with masked attention (like BoT-Hard) with layers with unmasked attention.

detokenizer

The features output by the Transformer encoder are fed into a linear layer and then projected into actions associated with the limb of that node; these actions are assigned based on the proximity of the corresponding actuator to the limb. Again, these learnable linear projection layers are separate for each node. If BoT is used as a critic architecture in a reinforcement learning setting, the output of the detokenizer is no longer an action, but a value, which is then averaged over the body parts.

experiment

The team evaluated the performance of BoT in both imitation learning and reinforcement learning settings. They maintained the same structure as Figure 1 and only replaced the BoT encoder with various baseline architectures to determine the effect of the encoder.

The goals of these experiments are to answer the following questions:

  • Can masked attention improve the performance and generalization of imitation learning?
  • Does BoT show positive scaling trends compared to the original Transformer architecture?
  • Is BoT compatible with reinforcement learning frameworks, and what reasonable design choices can maximize performance?
  • Can BoT strategies be applied to real-world robotics tasks?
  • What are the computational advantages of masked attention?

Imitation Learning Experiment

The team evaluated the imitation learning performance of the BoT architecture on the body tracking task, which is defined using the MoCapAct dataset.

The results are shown in Figure 3a, where we can see that BoT consistently outperforms the MLP and Transformer baselines. Notably, BoT’s advantage over these architectures increases further on unseen validation video clips, demonstrating that the body-aware inductive bias can lead to improved generalization.



Figure 3b shows that BoT-Hard scales well. Compared with the Transformer baseline, its performance on both training and validation video clips increases with the number of trainable parameters. This further shows that BoT-Hard tends not to overfit the training data, and this overfitting is caused by embodiment bias. More experimental examples are shown below, see the original paper for details.





Reinforcement Learning Experiment

The team evaluated the reinforcement learning performance of BoT versus a baseline using PPO on four robotic control tasks in Isaac Gym: Humanoid-Mod, Humanoid-Board, Humanoid-Hill, and A1-Walk.

Figure 5 shows the average episode returns of the evaluation rollouts during training for MLP, Transformer, and BoT (Hard and Mix). The solid line corresponds to the mean and the shaded area corresponds to the standard error over five seeds.



Results show that BoT-Mix consistently outperforms MLP and vanilla Transformer baselines in terms of sample efficiency and asymptotic performance, suggesting that incorporating biases from the robot body into the policy network architecture is useful.

Meanwhile, BoT-Hard outperforms the original Transformer on simpler tasks (A1-Walk and Humanoid-Mod), but performs worse on more difficult exploration tasks (Humanoid-Board and Humanoid-Hill). Considering that masked attention hinders the propagation of information from distant body parts, BoT-Hard’s strong limitation on information communication may hinder the efficiency of reinforcement learning exploration.

Real-world experiments

The Isaac Gym simulated sports environment is often used to transfer reinforcement learning policies from virtual to real environments without having to adjust them in the real world. To verify whether the proposed architecture is suitable for real-world applications, the team deployed a BoT policy trained above to a Unitree A1 robot. As can be seen in the following video, the new architecture can be reliably used for real-world deployment.



Computational analysis

The team also analyzed the computational cost of the new architecture, as shown in Figure 6. Here are the scalability results of the newly proposed masked attention and conventional attention on different sequence lengths (number of nodes).



As you can see, when there are 128 nodes (equivalent to a humanoid robot with dexterous arms), the new attention can increase the speed by 206%.

Overall, this suggests that the organism-derived bias in the BoT architecture not only improves the overall performance of the physical agent, but also benefits from the naturally sparse masks of the architecture. This approach can significantly reduce the training time of the learning algorithm through sufficient parallelization.