Part 6: Partial Observability and Deep Recurrent Q-Networks
In this installment of my Simple RL series, I want to introduce the concept of Partial Observability and demonstrate how to design neural agents which can successfully deal with it.
For us humans, having access to a limited and changing world is a universal aspect of our shared experience. Despite our partial access to the world, we are able to solve all sorts of challenging problems in the course of going about our daily lives.
The Problem of Partial Observability
Because the entire world is visible at any moment (and nothing moves aside from the agent), a single frame of this environment gives the agent all it needs to know in order to maximize its reward. Environments which follow a structure where a given state conveys everything the agent needs to act optimally are called Markov Decision Processes (MDPs).
Can you see what is behind you? This limited perspective on the visual world is almost always the default for humans and other animals. Even if we were to have 360 degree vision, we may still not know what is on the other side of a wall just beyond us. Information outside our view is often essential to making decisions regarding the world.
Environments which present themselves in a limited way to the agent are referred to as Partially Observable Markov Decision Processes (POMDPs). While they are trickier to solve than their fully observable counterparts, understanding them is essential to solving most realistic tasks.
Making sense of a limited, changing world
How can we build a neural agent which still functions well in a partially observable world? The key is to give the agent a capacity for temporal integration of observations.
The intuition behind this is simple: if information at a single moment isn’t enough to make a good decision, then enough varying information over time probably is.
Within the context of Reinforcement Learning, there are a number of possible ways to accomplish this temporal integration. The solution taken by DeepMind in their original paper on Deep Q-Networks was to stack the frames from the Atari simulator.
Instead of feeding the network a single frame at a time, they used an external frame buffer which kept the last four frames of the game in memory and fed this to the neural network.
but it isn’t ideal for a number of reasons.
The first is that it isn’t necessarily biologically plausible. When light hits our retinas, it does it at a single moment. There is no way for light to be stored up and passed all at once to an eye.
Secondly, by using blocks of 4 frames as their state, the experience buffer used needed to be much larger to accommodate the larger stored states. This makes the training process require a larger amount of potentially unnecessary memory.
Lastly, we may simply need to keep things in mind that happened much earlier than would be feasible to capture with stacking frames. Sometimes an event hundreds of frames earlier might be essential to deciding what to do at the current moment. We need a way for our agent to keep events in mind more robustly.
Recurrent Neural Networks
You may have heard of recurrent neural networks, and their capacity to learn temporal dependencies.
This has been used popularly for the purpose of text generation, where groups have trained RNNs to reproduce everything from Barack Obama speeches to freeform poetry.
Andrej Karpathy has a great post outlining RNNs and their capacities, which I highly recommend.
By utilizing a recurrent block in our network, we can pass the agent single frames of the environment, and the network will be able to change its output depending on the temporal pattern of observations it receives. It does this by maintaining a hidden state that it computes at every time-step. The recurrent block can feed the hidden state back into itself, thus acting as an augmentation which tells the network what has come before.
The class of agents which utilize this recurrent network are referred to as Deep Recurrent Q-Networks (DRQN).
Implementing in Tensorflow
In order to implement a Deep Recurrent Q-Network (DRQN) architecture in Tensorflow, we need to make a few modifications to our DQN described in Part 4
The first change is to the agent itself. We will insert a LSTM recurrent cell between the output of the last convolutional layer and the input into the split between the Value and Advantage streams.
The second main change needed will be to adjust the way our experience buffer stores memories. Since we want to train our network to understand temporal dependencies, we can’t use random batches of experience.
＞ Finally, we will be utilizing a technique developed by a group at Carnegie Mellon who recently used DRQN to train a neural network to play the first person shooter game Doom. Instead of sending all the gradients backwards when training their agent, they sent only the last half of the gradients for a given trace.
The Limited Gridworld
Implementing the network itself
すごくざっくり説明をすると、static_rnnはinputの前処理をユーザーに任して決められたグラフを書く反面、 dynamic_rnn は前処理などを全部内部で処理して、それを元にグラフを書く違いがあります。
#In order to only propogate accurate gradients through the network, we will mask the first #half of the losses for each trace as per Lample & Chatlot 2016
Basic LSTM recurrent network cell.
Training the network
Testing the network