【苦しみながら理解するReinforcement Learning】チュートリアル その3

その2から続いています。

Quick Read

Reinforcement Learning from scratch

a tutorial on Deep Reinforcement Learning (Deep RL) from scratch by Unity Technologiesの内容シェアです。

The rise of Deep Reinforcement Learning

直訳すると、深層強化学習のブチアゲですね。

top highlightは以下の通りです。(medium)

In dialog systems for example, classical Deep Learning aims to learn the right response for a given query. On the other hand, Deep Reinforcement Learning focuses on the right sequences of sentences that will lead to a positive outcome, for example a happy customer.

principleですね。

The best way to do this rapidly is by using a simulation environment. This tutorial will be using Unity to create environments to train agents in.

その通りだと思うですが、Unityの息の根がかかっている。

ですが、本記事かなり良い!

今まで理論として学んできたものが、実装ベースで順を追って載っている!
実装自体は載っていないのですが、まず、〜やって、次に〜やってみたいにイメージがつきやすいです!

We will use a method called Temporal Difference (TD) learning to learn a good Q function. The idea is to only look at a limited number of steps in the future. TD(1) for example, only uses the next 2 states to evaluate the reward.
Surprisingly, we can use TD(0), which looks at the current state, and our estimate of the reward the next turn, and get great results. The structure of the network is the same, but we need to go through one forward step before receiving the error. We then use this error to back propagate gradients, like in traditional Deep Learning, and update our value estimates.

TD(1)はterminal valueのみじゃなかったか?
次の2つの状態ってなんなんだ…

Another method to estimate the eventual success of our actions is Monte Carlo Estimates.

モンテカルロだけで、TD(λ)は最終価値関係ないんだっけ?

The approach we will use here is called Policy Gradients, and is an on-policy method. Previously, we were first learning a value function Q for each action in each state and then building a policy on top. In Vanilla Policy Gradient, we still use Monte Carlo Estimates, but we learn our policy directly through a loss function that increases the probability of choosing rewarding actions.
The way that we encourage exploration is by using a method called entropy regularization, which pushes our probability estimates to be wider, and thus will encourage us to make riskier choices to explore the space.

Policy GradientsVanilla Policy Gradiententropy regularizationなど初にお目にかかります。
その他にもAdvantage Actor Critic (A2C)なども。

Deep Reinforcement Learning and Generative Adversarial Networks

Generative Adversarial Networks(噂の通称GAN)とDeep Reinforcement Learningについて。
GANが出てきたちょっとうれしい。

Generative Adversarial Networks, which churn out stunning graphics — e.g., drawings, paintings, photographs —
Deep Reinforcement Learning models, which have exceeded human performance on complex problems, including Atari video games, the popular Asian board game Go, and safely driving an automobile.

Generative Adversarial Networks

GAN involve two separate deep neural networks acting against each other as adversaries.

The first network — the Generator — attempts to create forgeries of some category of human-created images, say all the works of Claude Monet or Pablo Picasso.

Generatorは、何かしらのアルゴリズムで画像を生成できるようですね。写真からモネ風絵とかピカソ風の絵とか。

The second network — the Discriminator — does its best to distinguish the forgeries from the real images.

Discriminatorは、本当の写真から検出に使われると。

CycleGAN(paper, code)というのも出てきて、馬の動画を全部シマウマにしてしまうなどできる。

ラーメンと人の顔を置き換えるとかヤバイやつもcreater’s project pageにある…

顔ラーメン

というかGANは教科学習ではないね…w

Deep Reinforcement Learning

有名なアタリのpaper紹介。

他は有料チュートリアルの宣伝ですw

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です