【苦しみながら理解するReinforcement Learning】チュートリアル その4-4

今回は、Part 4: Deep Q-Networks and Beyond
これが噂のDQN

概要

DQNに関して

It will be built upon the simple one layer Q-network we created in Part 0, …

In order to transform an ordinary Q-Network into a DQN we will be making the following improvements:

  1. Going from a single-layer network to a multi-layer convolutional network.
  2. Implementing Experience Replay, which will allow our network to train itself using stored memories from it’s experience.
  3. Utilizing a second “target” network, which we will use to compute target Q-values during our updates.

DQNは有名なアタリのゲームのやつに使われている。

I will discuss two simple additional improvements to the DQN architecture, Double DQN and Dueling DQN, that allow for improved performance, stability, and faster training time.

この辺りも聞いたことがありますね。
Double DQNとかDueling DQNとか…

Getting from Q-Network to Deep Q-Network

DQN

Addition 1: Convolutional Layers

画面から読み取る層。

Addition 2: Experience Replay

Experience Replayの層。

By keeping the experiences we draw random, we prevent the network from only learning about what it is immediately doing in the environment, and allow it to learn from a more varied array of past experiences.

過去からももっと学べるようにこの層がある?

The Experience Replay buffer stores a fixed number of recent memories, and as new ones come in, old ones are removed. When the time comes to train, we simply draw a uniform batch of random memories from the buffer, and train our network with them.

Addition 3: Separate Target Network

The third major addition to the DQN that makes it unique is the utilization of a second network during the training procedure. This second network is used to generate the target-Q values that will be used to compute the loss for every action during training.

In order to mitigate that risk, the target network’s weights are fixed, and only periodically or slowly updated to the primary Q-networks values.

Instead of updating the target network periodically and all at once, we will be updating it frequently, but slowly. This technique was introduced in another DeepMind paper earlier this year, where they found that it stabilized the training process.

another DeepMind paperはコレ。

Going Beyond DQN

But the world moves fast, and a number of improvements above and beyond the DQN architecture described by DeepMind, have allowed for even greater performance and stability.

I will provide a description and some code for two of them: Double DQN, and Dueling DQN. Both are simple to implement, and by combining both techniques, we can achieve better performance with faster training times.

Double DQN

The main intuition behind Double DQN is that the regular DQN often overestimates the Q-values of the potential actions to take in a given state.

In order to correct for this, the authors of DDQN paper propose a simple trick: instead of taking the max over Q-values when computing the target-Q value for our training step, we use our primary network to chose an action, and our target network to generate the target Q-value for that action.

By decoupling the action choice from the target Q-value generation, we are able to substantially reduce the overestimation, and train faster and more reliably.

target Q-valueからactionを分離することで、Q-valueを過大評価することを避ける。
以下が新しいQ-valueの式。

Q-Target = r + γQ(s’,argmax(Q(s’,a,ϴ),ϴ’))

Dueling DQN

Dueling-DQN

The Q-values that we have been discussing so far correspond to how good it is to take a certain action given a certain state. This can be written as Q(s,a). This action given state can actually be decomposed into two more fundamental notions of value. The first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.

Q(s,a) =V(s) + A(a)

The goal of Dueling DQN is to have a network that separately computes the advantage and value functions, and combines them back into a single Q-function only at the final layer.

The key to realizing the benefit is to appreciate that our reinforcement learning agent may not need to care about both value and advantage at any given time.

We can achieve more robust estimates of state value by decoupling it from the necessity of being attached to specific actions.

Putting it all together

For educational purposes, I have built a simple game environment which our DQN learns to master in a couple hours on a moderately powerful machine (I am using a GTX970).

ちょっとGPU買おうと思っているので、GXT970使ってるっていうのは覚えておきたい。

In the environment the agent controls a blue square, and the goal is to navigate to the green squares (reward +1) while avoiding the red squares (reward -1).
At the start of each episode all squares are randomly placed within a 5×5 grid-world.

コード例

Load the game environment

Above is an example of a starting environment in our simple game.

作ったのか…

Implementing the network itself

reduce_meanに関してはこちらがわかりやすい。

その他はコメントが非常にわかりやすいので、大丈夫かな…

Experience Replay

updateTargetGraphのindexでidx+total_vars//2と、//2しているのは何故なんだろうか…

experience_bufferclassのsampleの5とか、processState21168とかはどういう基準なのだろうか…

Training the network

tf.train.Saver()

Saves and restores variables.

はじめの内は、完全にグリーディなんだな…それをpre_train_stepsとしている。

trainBatch = myBuffer.sample(batch_size) #Get a random batch of experiences.

updateするときにランダムで経験を選択するのか。

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です