Unifying State Representation Learning with Intrinsic Motivations in Reinforcement Learning

Robots are in demand to work in unknown or unpredictable environments such as navigating roads or picking objects from a random pile. Model-based control methods cannot work in these environments. Learning-based control methods, such as reinforcement learning, do not need accurate models to find good control policies in unknown environments. Reinforcement learning, however, suffers from the curse of dimensionality. The computation power needed increases exponentially with the size of the robots observation. This problem can be mitigated by filtering out important features in the observation into a low dimensional synthetic state representation. Currently, synthetic state representations are trained using a history of observations and actions gathered using a random action policy. A random action policy does not use the information gathered to adjust what states its samples. There is a possibility to improve the training of synthetic state representation by improving the choice of actions used to collect samples to train the synthetic state representation.

We trained state representations for an environment with simple consistent visual features, and one with complex distractor features. For each environment, we tested four different training policies random action policy, entropy maximization, prediction error maximization, and uniform sampling.

We found different sampling methods can lead to different sampling distributions, depending on the training parameters and environment. The uniformity of coverage is important for complex environments where distractor features in one part of the environment do not generalize to other areas. The uniformity is not important to learn a good structure when the features of the environment are consistent enough that the SRL can generalize from one area of the environment to another. Finally, The relation between the structure of the state representation and the performance of RL policy is complex. In a simple environment, better structural scores seem to improve RL performance. In the complex environment, this is the opposite case. In the complex environment, clustering of states caused by the distractor features may be disruptive to policy learning.

Sampling methods that lead to a more uniform sampling distribution may improve the state representation learning structural performance. However, this is only the case of complex environments where generalization is impossible. Finally, what a good structure is for a synthetic state representation is still unknown. This is because the "improved" structure of the state representation does not necessarily lead to higher RL performance. Therefore more research needs to be done into how the structure of state representations can facilitate RL performance, and how sampling methods can support the learning of those structures.

To join the presentation via Microsoft Teams click here