【苦しみながら理解するReinforcement Learning】チュートリアル その4-8

今回でこのシリーズラスト!

Part 8: Asynchronous Actor-Critic Agents (A3C)

概要

In this article I want to provide a tutorial on implementing the Asynchronous Advantage Actor-Critic (A3C) algorithm in Tensorflow.

これは明らかに初耳だ。

So what is A3C? The A3C algorithm was released by Google’s DeepMind group earlier this year, and it made a splash by… essentially obsoleting DQN.

二年越し…

It was faster, simpler, more robust, and able to achieve much better scores on the standard battery of Deep RL tasks. On top of all that it could work in continuous as well as discrete action spaces. Given this, it has become the go-to Deep RL algorithm for new challenging problems with complex state and action spaces.

The 3 As of A3C

A3C

Asynchronous

Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently.

分散処理というか、並列処理ができるアレか。

The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is independent of the experience of the others.

Actor-Critic

In the case of A3C, our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs).

これは本でも紹介があったような…

Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.

Advantage

The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected. Intuitively, this allows the algorithm to focus on where the network’s predictions were lacking.

Advantage: A = Q(s,a) - V(s)

Since we won’t be determining the Q values directly in A3C, we can use the discounted returns (R) as an estimate of Q(s,a) to allow us to generate an estimate of the advantage.

Discounted Reward: R = γ(r)
Advantage Estimate: A = R - V(s)

でも、このチュートリアルでは別の方法を使うらしい。

In this tutorial, we will go even further, and utilize a slightly different version of advantage estimation with lower variance referred to as Generalized Advantage Estimation.

A3C algorithm

実装において参考になるやつ。

コード例

  • AC_Network — ネットワークのclass
  • Worker — AC_Networkのコピーのclassや環境のclassを含み、環境とのやりとりをし、global networkを更新する
  • Workerは並列化できる

The A3C algorithm begins by constructing the global network. This network will consist of convolutional layers to process spatial dependencies, followed by an LSTM layer to process temporal dependencies, and finally, value and policy output layers.

ここでいうところのprocess spatial dependenciesto process temporal dependenciesの違いがイマイチわかんねぇ

Next, a set of worker agents, each with their own network and environment are created. Each of these workers are run on a separate processor thread, so there should be no more workers than there are threads on your CPU.

あれ、これは抜粋なのか..classの定義ないぜぃ

~ From here we go asynchronous ~
Each worker begins by setting its network parameters to those of the global network. We can do this by constructing a Tensorflow op which sets each variable in the local worker network to the equivalent variable value in the global network.

Each worker then interacts with its own copy of the environment and collects experience. Each keeps a list of experience tuples (observation, action, reward, done, value) that is constantly added to from interactions with the environment.

Once the worker’s experience history is large enough, we use it to determine discounted return and advantage, and use those to calculate value and policy losses.
We also calculate an entropy (H) of the policy. This corresponds to the spread of action probabilities.

Value Loss: L = Σ(R - V(s))²
Policy Loss: L = -log(π(s)) * A(s) - β*H(π)

まぁ、そういう感じなのかという感じ。
そういう時はたいてい理解が甘い…

省略なし

また、この質問とか非常によかった。
これぐらい読まないといけませんね。

その他

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です