【苦しみながら理解する強化学習】チュートリアル その4-7 概要

In this entry of my RL series I would like to focus on the role that exploration plays in an agent’s behavior.
I will go over a few of the commonly used approaches to exploration which focus on action-selection, and show their comparative strengths and weaknesses, as well as demonstrate how to implement each using Tensorflow.

Why Explore?

これは簡単ですよね。
その時点で報酬が高いと思われている行動でも、探索すれば更に報酬が高い方法を見つけることができるかもしれないからですよね。

ただ、ここでは行動一手ではなく、行動の手法自体を選択するということみたいです。

Greedy Approach

All reinforcement learning algorithms seek to maximize reward over time.

The problem with a greedy approach is that it almost universally arrives at a suboptimal solution.

まぁ、よくないとされていますよね。

Random Approach

The opposite approach to greedy selection is to simply always take a random action.

これが書籍では良きとされていた。

Boltzmann Approach

はじめてきた？
いや、みたことあるな。

In exploration, we would ideally like to exploit all the information present in the estimated Q-values produced by our network.

this approach involves choosing an action with weighted probabilities.

If there are 4 actions available to an agent, in e-greedy the 3 actions estimated to be non-optimal are all considered equally, but in Boltzmann exploration they are weighed by their relative value.

In practice we utilize an additional temperature parameter (τ) which is annealed over time.

ガンマがソフトマックスに影響を与える。 Instead what the agent is estimating is a measure of how optimal the agent thinks the action is, not how certain it is about that optimality. While this measure can be a useful proxy, it is not exactly what would best aid exploration. What we really want to understand is the agent’s uncertainty about the value of different actions.

この方法だと最適な行動だけではなく、不確実性も知ることが出来る。

Bayesian Approaches (w/ Dropout)

What if an agent could exploit its own uncertainty about its actions? This is exactly the ability that a class of neural network models referred to as Bayesian Neural Networks (BNNs) provide.

In practice however it is impractical to maintain a distribution over all weights. Instead we can utilize dropout to simulate a probabilistic network.

Dropout is a technique where network activations are randomly set to zero during the training process in order to act as a regularizer.

In order to reduce the noise in the estimate, the dropout keep probability is simply annealed over time from 0.1 to 1.0.

Comparison & Full Code これを見ると明らかにBayesianですね。

コード例

こんな簡単に追加実装が…
bayesianとかargmaxするだけだ。
ただ、ここで再確認できるのが、今回は行動の選択方法に違いを出しているということです。

カテゴリー 機械学習