# 【苦しみながら理解する強化学習】チュートリアル その4-2

## 概要

https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb

I am going to describe how we get from that simple agent to one that is capable of taking in an observation of the world, and taking actions which provide the optimal reward not just in the present, but over the long run. With these additions, we will have a full reinforcement agent.

Environments which pose the full problem to an agent are referred to as Markov Decision Processes (MDPs).

notationのメモ

To be a little more formal, we can define a Markov Decision Process as follows. An MDP consists of a set of all possible states S from which our agent at any time will experience s. A set of all possible actions A from which our agent at any time will take action a. Given a state action pair (s, a), the transition probability to a new state s’ is defined by T(s, a), and the reward r is given by R(s, a). As such, at any time in an MDP, an agent is given a state s, takes action a, and receives new state s’ and reward r.

ここでは、再度OpenAIのgymを使います！
Cart-Poleってやつ！

このタスクでは二本腕のバンディットと異なり次の2点が必要になります。

1. Observations — The agent needs to know where pole currently is, and the angle at which it is balancing. To accomplish this, our neural network will take an observation and use it when producing the probability of an action.
2. Delayed reward — Keeping the pole in the air as long as possible means moving in ways that will be advantageous for both the present and the future. To accomplish this we will adjust the reward value for each observation-action pair using a function that weighs actions over time.

To accomplish this, we will collect experiences in a buffer, and then occasionally use them to update the agent all at once. These sequences of experience are sometimes referred to as rollouts, or experience traces. We can’t just apply these rollouts by themselves however, we will need to ensure that the rewards are properly adjusted by a discount factor

## コード例

### The Policy-Based Agent

loss functionをupdateしたみたいですが、

ここですね。

indexesは、行列の数（全状態数と全行動数の組み合わせ）にaction_holderを足しているんですが、なぜ足すんだ…
responsible_outputsgatherはその名の通り、集めてくるんですね。

Gather slices from params axis axis according to indices

そして、lossを出しているreduce_mean

[blogcar url=”https://qiita.com/maguro27/items/2effbbafc2c8e7a7eb64#tfreduce_mean”]

とここで再度tf.trainable_variables()にモヤモヤさまぁ～ず。
もう一度調べると、trainable=Trueの変数を返します！とのことなので、新しく作るのではなく、既に作られて、且つ、trainable=Truetvarsに入っているということですね！

tf.gradientsに関してはこちらがわかりやすい！