In the above figure, we illustrate BGN by combining it with a standard actor-critic (A2C) agent that takes in histories (past actions and observations). Assuming that the actor and critic do not share parameters, we add two BGNs for the separate feature extractors. Black components represent a standard dual-network A2C agent, while blue components indicate the additional network heads for reconstructing the belief. FC stands for a fully-connected network (a linear combination). We assume that observations and actions are discrete, and therefore use softmax activation functions for the policy distribution (Distr.) and the reconstructed beliefs. In continuous environments, the softmax can be substituted for other families of distributions. The two added branch helps to learn useful features from the history for the given task. During deployment, they can be removed and leave alone the actor that can choose actions using only the history as the input.
By reconstructing the belief during simulated training, the agent learns to overcome uncertainty in its environment. Most notably, the BGN does not require beliefs during execution, making the trained policies amenable to physical systems. When transferred to a real-world robot, these policies masterfully completed all of our proposed manipulation tasks without any fine tuning or other adjustments. Our work focused on discrete-state environments, but future work could easily extend our method to continuous-state tasks. For example, the belief update could be approximated by various methods such as Particle Filter or Kalman Filter.