【苦しみながら理解するReinforcement Learning】チュートリアル その4-1

今回は、Part 1 — Two-Armed Bandit

概要

Two-Armed Bandit

The simplest reinforcement learning problem is the n-armed bandit. Essentially, there are n-many slot machines, each with a different fixed payout probability. The goal is to discover the machine with the best payout, and maximize the returned reward by always choosing it. We are going to make it even simpler, by only having two possible slot machines to choose between.

  1. Different actions yield different rewards. For example, when looking for treasure in a maze, going left may lead to the treasure, whereas going right may lead to a pit of snakes.
  2. Rewards are delayed over time. This just means that even if going left in the above example is the right things to do, we may not know it till later in the maze.
  3. Reward for an action is conditional on the state of the environment. Continuing the maze example, going left may be ideal at a certain fork in the path, but not at others.

The n-armed bandit is a nice starting place because we don’t have to worry about aspects #2 and 3.

なぜ、n本腕になると2と3は心配しなくてよくなるのか…?

ここではn本腕からはじめることになるのだが、アプローチ方法は2つある。

1. policy gradients: our simple neural network learns a policy for picking actions by adjusting it’s weights through gradient descent using feedback from the environment. 
2. value functions: approaches, instead of learning the optimal action in a given state, the agent learns to predict how good a given state or action will be for the agent to be in.

Policy Gradient

To update our network, we will simply try an arm with an e-greedy policy

もう主流と化している…

[policy loss equation]
Loss = -log(π)*A

A is advantage, and is an essential aspect of all reinforcement learning algorithms. Intuitively it corresponds to how much better an action was than some baseline.

ここではベースラインは0。

π is the policy. In this case, it corresponds to the chosen action’s weight.

Intuitively, this loss function allows us to increase the weight for actions that yielded a positive reward, and decrease them for actions that yielded a negative reward.

コード例

The Agent

tensorflowあんまり触らないからアレだな…
というか忘れる…

tf.Variableに関してはここがわかりやすいかったです。

placeholderはその名の通り、とりあえず、場所確保的なものだった。

sliceは結構使えそうなんだが。

lossに関しては、Policy Gradientのところの式を使っている。
あそこでいってた、AdvantageのAというのは報酬だったのか。

Training the Agent

np.random.rand(1)は0~1までの乱数を発生させる。

sessに関してはこちらがわかりやすいです!

その他は大丈夫そうですね!

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です