【苦しみながら理解するReinforcement Learning】チュートリアル その4-2

今回は、Part 2 - Policy-based Agents

概要

https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb

I am going to describe how we get from that simple agent to one that is capable of taking in an observation of the world, and taking actions which provide the optimal reward not just in the present, but over the long run. With these additions, we will have a full reinforcement agent.

Environments which pose the full problem to an agent are referred to as Markov Decision Processes (MDPs).

notationのメモ

To be a little more formal, we can define a Markov Decision Process as follows. An MDP consists of a set of all possible states S from which our agent at any time will experience s. A set of all possible actions A from which our agent at any time will take action a. Given a state action pair (s, a), the transition probability to a new state s’ is defined by T(s, a), and the reward r is given by R(s, a). As such, at any time in an MDP, an agent is given a state s, takes action a, and receives new state s’ and reward r.

Cart-Pole Task

ここでは、再度OpenAIのgymを使います!
Cart-Poleってやつ!

このタスクでは二本腕のバンディットと異なり次の2点が必要になります。

  1. Observations — The agent needs to know where pole currently is, and the angle at which it is balancing. To accomplish this, our neural network will take an observation and use it when producing the probability of an action.
  2. Delayed reward — Keeping the pole in the air as long as possible means moving in ways that will be advantageous for both the present and the future. To accomplish this we will adjust the reward value for each observation-action pair using a function that weighs actions over time.

To accomplish this, we will collect experiences in a buffer, and then occasionally use them to update the agent all at once. These sequences of experience are sometimes referred to as rollouts, or experience traces. We can’t just apply these rollouts by themselves however, we will need to ensure that the rewards are properly adjusted by a discount factor

将来の価値を時々求める必要があるということなのですか?…

コード例

The Policy-Based Agent

loss functionをupdateしたみたいですが、

        self.indexes = tf.range(0, tf.shape(self.output)[0]) * tf.shape(self.output)[1] + self.action_holder
        self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), self.indexes)
        self.loss = -tf.reduce_mean(tf.log(self.responsible_outputs)*self.reward_holder)     

ここですね。

indexesは、行列の数(全状態数と全行動数の組み合わせ)にaction_holderを足しているんですが、なぜ足すんだ…
responsible_outputsgatherはその名の通り、集めてくるんですね。

Gather slices from params axis axis according to indices

そして、lossを出しているreduce_mean

与えたリストに入っている数値の平均値を求める関数

[blogcar url=”https://qiita.com/maguro27/items/2effbbafc2c8e7a7eb64#tfreduce_mean”]

勾配(gradient)についても追加されていますね!

        tvars = tf.trainable_variables()
        self.gradient_holders = []
        for idx,var in enumerate(tvars):
            placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
            self.gradient_holders.append(placeholder)

        self.gradients = tf.gradients(self.loss,tvars)

とここで再度tf.trainable_variables()にモヤモヤさまぁ~ず。
もう一度調べると、trainable=Trueの変数を返します!とのことなので、新しく作るのではなく、既に作られて、且つ、trainable=Truetvarsに入っているということですね!

tf.gradientsに関してはこちらがわかりやすい!

Training the Agent

    gradBuffer = sess.run(tf.trainable_variables())
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0

ここはgradの初期化ですよね!?

            a_dist = sess.run(myAgent.output,feed_dict={myAgent.state_in:[s]})
            a = np.random.choice(a_dist[0],p=a_dist[0])
            a = np.argmax(a_dist == a)

ここは、状態sの選択可能な行動a_distから、a_distの(一様分布ではなく)報酬に応じて行動aを選択して、行動aのindexを取得している。

In the next post I will be showing how to use Deep Neural Networks to create agents able to learn more complex relationships with the environment in order to play a more exciting game than pole balancing.

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です