# 【苦しみながら理解する強化学習】チュートリアル その4-1

## Two-Armed Bandit

The simplest reinforcement learning problem is the n-armed bandit. Essentially, there are n-many slot machines, each with a different fixed payout probability. The goal is to discover the machine with the best payout, and maximize the returned reward by always choosing it. We are going to make it even simpler, by only having two possible slot machines to choose between.

1. Different actions yield different rewards. For example, when looking for treasure in a maze, going left may lead to the treasure, whereas going right may lead to a pit of snakes.
2. Rewards are delayed over time. This just means that even if going left in the above example is the right things to do, we may not know it till later in the maze.
3. Reward for an action is conditional on the state of the environment. Continuing the maze example, going left may be ideal at a certain fork in the path, but not at others.

The n-armed bandit is a nice starting place because we don’t have to worry about aspects #2 and 3.

なぜ、n本腕になると2と3は心配しなくてよくなるのか…？

ここではn本腕からはじめることになるのだが、アプローチ方法は2つある。

To update our network, we will simply try an arm with an e-greedy policy

もう主流と化している…

A is advantage, and is an essential aspect of all reinforcement learning algorithms. Intuitively it corresponds to how much better an action was than some baseline.

ここではベースラインは0。

π is the policy. In this case, it corresponds to the chosen action’s weight.

Intuitively, this loss function allows us to increase the weight for actions that yielded a positive reward, and decrease them for actions that yielded a negative reward.

## コード例

### The Agent

tensorflowあんまり触らないからアレだな…
というか忘れる…

tf.Variableに関してはここがわかりやすいかったです。

placeholderはその名の通り、とりあえず、場所確保的なものだった。

sliceは結構使えそうなんだが。

lossに関しては、Policy Gradientのところの式を使っている。
np.random.rand(1)は0~1までの乱数を発生させる。