State representation learning using a graph neural network in a 2D grid world

Finished: 2021-05-27

MSc assignment

A Markov decision process (MDP) is a mathematically idealized form of the reinforcement learning problem. MDPs are a classical formalization of sequential decision making and provide the mathematical framework for modeling decision-making strategies. The goal of the reinforcement problem is to find a policy that solves the problem at hand in some optimal manner, i.e. by maximizing the expected sum of rewards. The optimal policy is the solution to the Bellman equation and can be found by dynamic programming by evaluating all the value functions in all the states. However, in most practical problems this is not feasible because the state space is simply too large to do dynamic programming. A solution to this problem is to make an estimation of the value function. This can be done by using the knowledge on function approximation. The value function is represented as a weighted sum of a set of features (called basis functions). The basis functions can be handcrafted or automatically generated. The model parameters are typically learned via parameter estimation methods. The main challenge here is to find a low-dimensional set of basis functions and still a meaningful representation of the MDP.

New insights let to the approach of modeling the MPD as a graph. The state-space topologies of MDPs can intuitively be represented as graphs, i.e. states being nodes and transition probabilities the edges. In most practical problems the MDP's transition probabilities are unknown but we can construct the graph from samples. After the construction of a graph, a lower-dimensional representation can be learned using node embedding methods. The goal is to optimize this mapping such that the geometric relationships in this learned space resemble the structure of the original graph. After optimizing the embedding, the learned representation can be used as feature inputs for downstream machine learning methods in order to approximate the value function. The key difference with previous work is that the structural information is captured in the graph and used to learn embeddings. Previously, the structural information was capture using hand-engineered features.

With the recent development of Graph Neural Networks (GNNs) a new door has opened in order to tackle the reinforcement problem. The relational structure of the GNN might improve the embedding methods in order to better approximate the value function. Therefore, in my thesis, I will be working on the subject of representation learning on graphs. The goal of this work is to advance the state of the art in representation learning using graphs and more specific make this method more practical in a robotics-reinforcement learning framework.