Part 4: Deep Q-Networks and Beyond
- 1 概要
- 2 Getting from Q-Network to Deep Q-Network
- 3 Addition 1: Convolutional Layers
- 4 Addition 2: Experience Replay
- 5 Addition 3: Separate Target Network
- 6 Going Beyond DQN
- 7 Double DQN
- 8 Dueling DQN
- 9 Putting it all together
- 10 コード例
It will be built upon the simple one layer Q-network we created in Part 0, …
In order to transform an ordinary Q-Network into a DQN we will be making the following improvements:
- Going from a single-layer network to a multi-layer convolutional network.
- Implementing Experience Replay, which will allow our network to train itself using stored memories from it’s experience.
- Utilizing a second “target” network, which we will use to compute target Q-values during our updates.
I will discuss two simple additional improvements to the DQN architecture, Double DQN and Dueling DQN, that allow for improved performance, stability, and faster training time.
Getting from Q-Network to Deep Q-Network
Addition 1: Convolutional Layers
Addition 2: Experience Replay
By keeping the experiences we draw random, we prevent the network from only learning about what it is immediately doing in the environment, and allow it to learn from a more varied array of past experiences.
The Experience Replay buffer stores a fixed number of recent memories, and as new ones come in, old ones are removed. When the time comes to train, we simply draw a uniform batch of random memories from the buffer, and train our network with them.
Addition 3: Separate Target Network
The third major addition to the DQN that makes it unique is the utilization of a second network during the training procedure. This second network is used to generate the target-Q values that will be used to compute the loss for every action during training.
In order to mitigate that risk, the target network’s weights are fixed, and only periodically or slowly updated to the primary Q-networks values.
Instead of updating the target network periodically and all at once, we will be updating it frequently, but slowly. This technique was introduced in another DeepMind paper earlier this year, where they found that it stabilized the training process.
another DeepMind paperはコレ。
Going Beyond DQN
But the world moves fast, and a number of improvements above and beyond the DQN architecture described by DeepMind, have allowed for even greater performance and stability.
I will provide a description and some code for two of them: Double DQN, and Dueling DQN. Both are simple to implement, and by combining both techniques, we can achieve better performance with faster training times.
The main intuition behind Double DQN is that the regular DQN often overestimates the Q-values of the potential actions to take in a given state.
In order to correct for this, the authors of DDQN paper propose a simple trick: instead of taking the max over Q-values when computing the target-Q value for our training step, we use our primary network to chose an action, and our target network to generate the target Q-value for that action.
By decoupling the action choice from the target Q-value generation, we are able to substantially reduce the overestimation, and train faster and more reliably.
Q-Target = r + γQ(s’,argmax(Q(s’,a,ϴ),ϴ’))
The Q-values that we have been discussing so far correspond to how good it is to take a certain action given a certain state. This can be written as Q(s,a). This action given state can actually be decomposed into two more fundamental notions of value. The first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.
Q(s,a) =V(s) + A(a)
The goal of Dueling DQN is to have a network that separately computes the advantage and value functions, and combines them back into a single Q-function only at the final layer.
The key to realizing the benefit is to appreciate that our reinforcement learning agent may not need to care about both value and advantage at any given time.
We can achieve more robust estimates of state value by decoupling it from the necessity of being attached to specific actions.
Putting it all together
For educational purposes, I have built a simple game environment which our DQN learns to master in a couple hours on a moderately powerful machine (I am using a GTX970).
In the environment the agent controls a blue square, and the goal is to navigate to the green squares (reward +1) while avoiding the red squares (reward -1).
At the start of each episode all squares are randomly placed within a 5×5 grid-world.
Load the game environment
Above is an example of a starting environment in our simple game.
Implementing the network itself
Training the network
Saves and restores variables.
trainBatch = myBuffer.sample(batch_size) #Get a random batch of experiences.