Part 8: Asynchronous Actor-Critic Agents (A3C)
In this article I want to provide a tutorial on implementing the Asynchronous Advantage Actor-Critic (A3C) algorithm in Tensorflow.
So what is A3C? The A3C algorithm was released by Google’s DeepMind group earlier this year, and it made a splash by… essentially obsoleting DQN.
It was faster, simpler, more robust, and able to achieve much better scores on the standard battery of Deep RL tasks. On top of all that it could work in continuous as well as discrete action spaces. Given this, it has become the go-to Deep RL algorithm for new challenging problems with complex state and action spaces.
The 3 As of A3C
Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently.
The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is independent of the experience of the others.
In the case of A3C, our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs).
Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.
The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected. Intuitively, this allows the algorithm to focus on where the network’s predictions were lacking.
Advantage: A = Q(s,a) - V(s)
Since we won’t be determining the Q values directly in A3C, we can use the discounted returns (R) as an estimate of Q(s,a) to allow us to generate an estimate of the advantage.
Discounted Reward: R = γ(r)
Advantage Estimate: A = R - V(s)
In this tutorial, we will go even further, and utilize a slightly different version of advantage estimation with lower variance referred to as Generalized Advantage Estimation.
- AC_Network — ネットワークのclass
- Worker — AC_Networkのコピーのclassや環境のclassを含み、環境とのやりとりをし、global networkを更新する
The A3C algorithm begins by constructing the global network. This network will consist of convolutional layers to process spatial dependencies, followed by an LSTM layer to process temporal dependencies, and finally, value and policy output layers.
process spatial dependenciesと
to process temporal dependenciesの違いがイマイチわかんねぇ
Next, a set of worker agents, each with their own network and environment are created. Each of these workers are run on a separate processor thread, so there should be no more workers than there are threads on your CPU.
~ From here we go asynchronous ~
Each worker begins by setting its network parameters to those of the global network. We can do this by constructing a Tensorflow op which sets each variable in the local worker network to the equivalent variable value in the global network.
Each worker then interacts with its own copy of the environment and collects experience. Each keeps a list of experience tuples (observation, action, reward, done, value) that is constantly added to from interactions with the environment.
Once the worker’s experience history is large enough, we use it to determine discounted return and advantage, and use those to calculate value and policy losses.
We also calculate an entropy (H) of the policy. This corresponds to the spread of action probabilities.
Value Loss: L = Σ(R - V(s))²
Policy Loss: L = -log(π(s)) * A(s) - β*H(π)