【苦しみながら理解する強化学習】チュートリアル その4-6

概要

In this installment of my Simple RL series, I want to introduce the concept of Partial Observability and demonstrate how to design neural agents which can successfully deal with it.

For us humans, having access to a limited and changing world is a universal aspect of our shared experience. Despite our partial access to the world, we are able to solve all sorts of challenging problems in the course of going about our daily lives.

The Problem of Partial Observability

Because the entire world is visible at any moment (and nothing moves aside from the agent), a single frame of this environment gives the agent all it needs to know in order to maximize its reward. Environments which follow a structure where a given state conveys everything the agent needs to act optimally are called Markov Decision Processes (MDPs).

Can you see what is behind you? This limited perspective on the visual world is almost always the default for humans and other animals. Even if we were to have 360 degree vision, we may still not know what is on the other side of a wall just beyond us. Information outside our view is often essential to making decisions regarding the world.

Environments which present themselves in a limited way to the agent are referred to as Partially Observable Markov Decision Processes (POMDPs). While they are trickier to solve than their fully observable counterparts, understanding them is essential to solving most realistic tasks.

agentは全ての因果関係を見つける方が、問題を解くより簡単な時は、実世界と同様の解き方で解く？
すごい…

Making sense of a limited, changing world

How can we build a neural agent which still functions well in a partially observable world? The key is to give the agent a capacity for temporal integration of observations.

The intuition behind this is simple: if information at a single moment isn’t enough to make a good decision, then enough varying information over time probably is.

1枚の写真でわからない時は、2枚の写真で予測するそうだ…

Within the context of Reinforcement Learning, there are a number of possible ways to accomplish this temporal integration. The solution taken by DeepMind in their original paper on Deep Q-Networks was to stack the frames from the Atari simulator.

Instead of feeding the network a single frame at a time, they used an external frame buffer which kept the last four frames of the game in memory and fed this to the neural network.

アタリのゲームでは1枚からではなく、4枚から学習していた！

but it isn’t ideal for a number of reasons.

！！

The first is that it isn’t necessarily biologically plausible. When light hits our retinas, it does it at a single moment. There is no way for light to be stored up and passed all at once to an eye.

そこはこだわらなくても良い気がするが。

Secondly, by using blocks of 4 frames as their state, the experience buffer used needed to be much larger to accommodate the larger stored states. This makes the training process require a larger amount of potentially unnecessary memory.

これは確かに…

Lastly, we may simply need to keep things in mind that happened much earlier than would be feasible to capture with stacking frames. Sometimes an event hundreds of frames earlier might be essential to deciding what to do at the current moment. We need a way for our agent to keep events in mind more robustly.

これは結果論で良さそうなんだよなぁ。

Recurrent Neural Networks

You may have heard of recurrent neural networks, and their capacity to learn temporal dependencies.

いや、すみません、聞いたことなかったッス！汗

This has been used popularly for the purpose of text generation, where groups have trained RNNs to reproduce everything from Barack Obama speeches to freeform poetry.

これはみたことあるかもしれぬ。

Andrej Karpathy has a great post outlining RNNs and their capacities, which I highly recommend.

great postはこちらです。

By utilizing a recurrent block in our network, we can pass the agent single frames of the environment, and the network will be able to change its output depending on the temporal pattern of observations it receives. It does this by maintaining a hidden state that it computes at every time-step. The recurrent block can feed the hidden state back into itself, thus acting as an augmentation which tells the network what has come before.

なんか、あまりよくわからんかったんだが、1枚のイメージを使って、得た出力を再度隠れ層にぶっこむということなのか…

The class of agents which utilize this recurrent network are referred to as Deep Recurrent Q-Networks (DRQN).

これみて実装しているみたいなので、それぐらいには早くなりたい…（ビッグマウス？）

Implementing in Tensorflow

In order to implement a Deep Recurrent Q-Network (DRQN) architecture in Tensorflow, we need to make a few modifications to our DQN described in Part 4

そんな簡単に！

The first change is to the agent itself. We will insert a LSTM recurrent cell between the output of the last convolutional layer and the input into the split between the Value and Advantage streams.

The second main change needed will be to adjust the way our experience buffer stores memories. Since we want to train our network to understand temporal dependencies, we can’t use random batches of experience.

＞　Finally, we will be utilizing a technique developed by a group at Carnegie Mellon who recently used DRQN to train a neural network to play the first person shooter game Doom. Instead of sending all the gradients backwards when training their agent, they sent only the last half of the gradients for a given trace.

コード例

Implementing the network itself

これをみていてようやく気づいたんですが、

state_inってstateのinitializeのことだったんですね…

slim.flattenはとりあえずバッチサイズとflattenしたものを返してくれるようですね。

すごくざっくり説明をすると、static_rnnはinputの前処理をユーザーに任して決められたグラフを書く反面、 dynamic_rnn は前処理などを全部内部で処理して、それを元にグラフを書く違いがあります。

このざっくり説明してくれる人って本当にありがたいですよね。

そして、ここのセクションは、後から使う時に区別できる単位でまとめてくれているのか。

どういうことなんだ…
なんだはじめの半分というのは…

Experience Replay

tf.nn.rnn_cell.BasicLSTMCell

Basic LSTM recurrent network cell.

LSTMに関してはこれがまとまっていそうだ…

Training the network

tf.contrib.rnn.BasicLSTMCell

これがわかりやすい。
LSTMに関しても書かれているので、読んだ方がよいです。

LSTMについてはここでオススメされているブログを読みます…あとで
メモリセルとかあんまわかんないし…

Testing the network

Double-DQNネットワークのアップデートがされていないけど、これはtrainingじゃないからか。
うーん、Part 5でみたようなテストの形式ではないのだけど、load_modelTrueにして動かしてみてるということなんだろう。