Graph Policy Gradients for Large Scale Robot Control

ARBAAZ KHAN

ARBAAZ KHAN

Philadelphia, Pennsylvania

7 0
  • 0 Collaborators

In this paper, the problem of learning policies to control a large number of homogeneous robots is considered. To this end, we propose a new algorithm we call Graph Policy Gradients (GPG) that exploits the underlying graph symmetry among the robots. ...learn more

Project status: Published/In Market

Robotics, Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon

Links [1]

Overview / Usage

In the recent past, deep learning has proved to be an extremely valuable tool for robotics. Harnessing the power of deep neural networks has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks. These include, but are not limited to, learning to play video games , learning complex control policies for robot tasks and learning to plan with only sensor information. While these works are impressive, they often concern themselves with controlling a single entity.

However, in many real world applications, especially in fields like robotics there exists a need to control multiple robots interact with each other in co-operative or competitive settings. Examples include warehouse management with teams of robots, multi-robot furniture assembly, and concurrent control and communication for teams of robots. In such a scenario, as the number of robots increases, the dimensionality of the input space and the control space both increase making it much harder to learn meaningful policies. This paper looks to tackle the problem of learning individual control policies by exploiting the underlying graph structure among the robots. We start with the hypothesis that the difficulty of learning scalable policies for multiple robots can be attributed to two key issues: dimensionality and partial information. Consider an environment with N robots (This work uses bold font when talking about a collection of items, vectors and matrices). Each robot receives partial observations of the environment. In order for a robot to learn a meaningful control policy, it must interact with some subset of all agents, nN. Finding the right subset of neighbors to learn from is in itself a challenging problem. Further, in order to ensure that the method scales, as the number of robots increase, one needs to ensure that the cardinality of the subset of neighbors |n|, remains fixed or grows very slowly.

Methodology / Approach

To solve the problem for large scale multi-robot control, we draw inspiration from convolutional neural networks (CNNs). CNNs consist of sequentially composed layers where each layer comprises of banks of linear time invariant filters to extract local features along with pooling operations to reduce the dimensionality. Convolutions regularize the linear transform to exploit the underlying structure of the data and hence constrain the search space for the linear transform that minimizes the cost function. However, CNNs cannot be directly applied to irregular data elements that have arbitrary pairwise relationships defined by an underlying graph.

To overcome this limitation, there has been the advent of a new architecture called graph convolutional networks (GCNs). Similar to CNNs, GCNs consist of sequentially composed layers that regularize the linear transform in each layer to be a graph convolution with a bank of graph filters and the weights of the filter are learned by minimizing some cost function. When controlling a large swarm of robots operating in a Euclidean space R n , the underlying graph can be defined as G = (V, E) where V is the set of nodes and E is the set of edges. For example, we define each robot to be a node and add an edge between robots if they are are less than some small epsilon distance apart. This graph acts as a support for the data vector x = [x1, . . . , xN ] > where xn is the state representation of robot n. Now, our GCN exploits this graph G and at each node aggregates information from its neighbors This information is propagated forward to the next layer through a non linear transformation. The output of the final layer is given by Π = [π1, . . . , πN ], where π1, . . . , πN are independent policies for the robots. Similar to standard reinforcement learning, we execute these policies in the environment, collect a centralized reward and use policy gradients to update the weights of policy network. We call this algorithm Graph Policy Gradients (GPG).

The full paper can be found at https://arxiv.org/pdf/1907.03822.pdf

Technologies Used

Intel DevCloud, Pytorch, Deep Graph Library, AirSim.

Comments (0)