Part 7: Action-Selection Strategies for Exploration
In this entry of my RL series I would like to focus on the role that exploration plays in an agent’s behavior.
I will go over a few of the commonly used approaches to exploration which focus on action-selection, and show their comparative strengths and weaknesses, as well as demonstrate how to implement each using Tensorflow.
All reinforcement learning algorithms seek to maximize reward over time.
The problem with a greedy approach is that it almost universally arrives at a suboptimal solution.
The opposite approach to greedy selection is to simply always take a random action.
In exploration, we would ideally like to exploit all the information present in the estimated Q-values produced by our network.
this approach involves choosing an action with weighted probabilities.
If there are 4 actions available to an agent, in e-greedy the 3 actions estimated to be non-optimal are all considered equally, but in Boltzmann exploration they are weighed by their relative value.
In practice we utilize an additional temperature parameter (τ) which is annealed over time.
Instead what the agent is estimating is a measure of how optimal the agent thinks the action is, not how certain it is about that optimality. While this measure can be a useful proxy, it is not exactly what would best aid exploration. What we really want to understand is the agent’s uncertainty about the value of different actions.
Bayesian Approaches (w/ Dropout)
What if an agent could exploit its own uncertainty about its actions? This is exactly the ability that a class of neural network models referred to as Bayesian Neural Networks (BNNs) provide.
In practice however it is impractical to maintain a distribution over all weights. Instead we can utilize dropout to simulate a probabilistic network.
Dropout is a technique where network activations are randomly set to zero during the training process in order to act as a regularizer.
In order to reduce the noise in the estimate, the dropout keep probability is simply annealed over time from 0.1 to 1.0.
Comparison & Full Code