Model Based Bayesian RL

By Akwasi A. Obeng and Shesh Mali

What is Model Based RL?

Imagine finding yourself in a new environment with no GPS or some other electronic gadget for navigation. What do you do? Well you could sit there and hope for some miraculous event or you could to resort to what humans have been doing since time immemorial, that is - explore your environment and create a mental picture of where things are or even better create a map of your surroundings . Based on the map created, planning becomes easier as you could easily construct paths that take you from A to B.

Well, we would like robots to have such capability. Given some objective, it would be nice if robots can create some sought of map and then plan accordingly to criteria specified. This is what is referred to as the Model Based RL. As the name suggests, a model of the environment(map) is created as the robot interacts with its environment and then robot(agent) plans by looking for behaviors(policy) that enables it to achieve the task assigned(maximizing reward). This is shown in the diagram below.

figure 1 (by David Silver)

Given the MDP <S, A,P, R> where S denotes States, A denotes Actions , P or T denotes the Transition model, and R denotes reward, the objective is to get an optimal value or policy function by first creating a model P instead of just determining value/policy function directly as utilized in model free approaches. Once the model P is known, we can use the policy iteration or value iteration algorithms to solve the MDP.

Now what if you are lucky enough to get hold of an outdated map of the unknown region which is so much better than toiling to and fro just to create a map from scratch. Do we go about discarding the map because it is not good enough? or Do we go ahead and use the map despite its uncertainties and if possible, improve it? Of course the latter is the more reasonable thing to do. It would be easier to just include a building or a boundary in the map than to create the map again all over again from scratch.

The model based RL discussed in earlier paragraphs assumes that the robot has a perfect map which is almost always not the case as the model is generated based on finite experience of data collected . How can we characterize these uncertainties ? Bayesian RL to the rescue

What is Model Based Bayesian RL?

In Model Based Bayesian RL, we account for the uncertainties utilized in the model. Given some sampled points as shown in figure 2, there may be more than one model that fits that these points as shown in figure 2. The uncertainty due to the number of different transition models is shown in figure 3.

figure 2 (By Pascal Poupart)

figure 3(By Pascal Poupart)

figure 4(By Pascal Poupart)

Different transition models may yield different results for the same input. How do we characterize the uncertainties due to these different transitions?

lets restrict ourselves to only 4 transition models T1, T2, T3, T4 . Given state x and action a , we may have T1(x'| x, a) = 0.2 ,
T2(x'| x, a) = 0.3 , T3(x'| x, a) = 0.1 , T4(x'| x, a) = 0.2 for input x' . Which of the transition models is the right one? Intuitively, take the value that is agreed by most of the transition models. The idea is to represent the different transition models as random variables and then consider the probabilities of these values. In this case, the probabilities P[T(x'| x, a) = 0.2] = 0.5 , P[T(x'| x, a) = 0.3] = 0.25 , P[T(x'| x, a) = 0.2] = 0.25 . One may consider that the value of the transition model is 0.2 , that is, T(x'| x, a) = 0.2 since it has the highest probability.

Bayesian RL in POMDP Framework

Bayesian RL can be thought of as a Partially Observable MDP. Remember that with POMDP, the states are not fully observable
and this makes the transition model T uncertain whereas with Bayesian Model Based RL, although the states may be fully known, the model is uncertain. In short , the problem statement of POMDP can be paraphrased as " I don't know my location on the map and therefore the map is not that useful for navigation" and with Model Based Bayesian RL " I may know my exact location, but the map is not that accurate."

In the POMDP setting, the problem of uncertainty with the states is accounted for by the use of belief states. A probabilistic estimation of the of how likely it is to be in one state is grouped together into a vector.

Therefore we have,

B(s) = <b(s1),b(s2),b(s3),b(s4),....,b(sn)> ---> Belief state

where b(s1) is the probability of being in state s1.

By capturing this uncertainty into a state, the problem is thus reduced to an MDP <B,R,A,P> .

B- Belief states, R- Reward, A-action P-Transition Model

Using this idea, the Bayesian Model Based RL can thus be reduced to continuous POMDP by considering states of the form <S,Q>,
s-> State( known) q-> parameter of transition model(unknown)
Since state <s, q> is unknown, the analysis utilized in the POMPD can be further applied. The modelling parameter q of the transition model could be from any sort of distribution, say Binomial distribution or Poisson distribution. The state <s, q> is continuous for most distributions since the q parameter is mostly continuous, the belief states is thus mostly continuous distribution, hence continuous POMPD.

Bayesian /POMDP framework accounts for lots of variables and is the more realistic approach to modelling in the real world, However, this comes at the expense of computational cost. Below are some algorithms and variations of Bayesian RL.


The approximate algorithms available to solve any MDP using bayesian inference are mainly classified as shown in the table below.[1]

Now, we will be discussing the Bayes-Adaptive Best Of Sampled Sets(BOSS) and Bayesian Exploration Bonus(BEB) approximation algorithms for estimating the value function.

BOSS: Best of Sample Sets

BOSS was one of the first approach assuming Bayesian learning framework. BOSS samples multiple models (K) from the posterior whenever the number of transitions from a state–action pair reaches a predefined threshold (B the known-ness parameter). It then combines the results into an optimistic MDP for decision making. When an action is required, BOSS draws some sample models from the posterior. It then creates a hyper-model, which has actions corresponding to each model. The transition function for action is constructed directly by taking the sampled transitions. It then solves the hyper-model M and selects the best action according to this hyper-model. This sampling procedure is repeated B times for each state-action pair, after which the policy of the last model is retained.[2]

BEB: Bayesian Exploration Bonus

Bayesian Exploration Bonus (BEB) guarantees bounded error within a small number of samples.. The algorithm simply chooses actions greedily with respect to the following value function:[3]

The exploration bonus decays significantly faster in BEB i.e by 1/n. It is possible to use this faster decay because BEB aims to achieve the optimal Bayesian solution.


Bayesian Reinforcement Learning Techniques can be applied to various domains where there are possible action selection criteria depending on the prior information about the problem. Some of the general important application domains are listed below:

  • It is applied to make better action predictions for stocks in order to decide whether to buy, sell or hold a stock to maximize the profit. Example: Predictive Modeling, Portfolio Optimization and Market Risk, Contagion and Credit Risk. [4]

  • It is applied in the healthcare sector to diagnose and suggest treatment based on the previous experiences i.e. Automated Medical Diagnosis and Dynamic Treatment Regimes. It could also be applied and extended to Health Resource Scheduling and Allocation, Optimal Process Control, Drug Discovery and Development, Health Management, etc. Detailed application specific categorization can been below. [5]

Fig. The outline of application domains of RL in healthcare( source : [5])

  • It is applied in the marketing sector for studying and targeting the best possible audience or product to maximize the profit. It has a wide application in this sector as it depends on a lot of aspects associated with human choices. [6]

  • It is applied in robotics as well gaming in multiple ways to take into account the uncertainty of human behavior. There is a wide spectrum of applications where the BRL frame-work is used to achieve significant performance. Today, we are expecting autonomous driving to turn out and take the road as quickly as possible. Using Bayesian inference to incorporate information into the learning process seems unavoidable.


Although Bayesian RL tends to provide a good framework for selecting optimal actions considering the prior, it is not always easy to model or represent the prior information. Hence, there can be research on how to generate the best prior to be used from the given data. Reinforcement Learning already has its own limitations which needs to be overcome first. Deploying a reinforcement learning framework to real-world problems/tasks is much more difficult as compared to that of game applications where there is availability of simulators.[7][8] BRL learns better with model misspecifications by maintaining a set of promising models to infer from using the prior knowledge but still there is a limit to which it can deal. We just can’t throw any random models and expect it to learn and converge. We need to have good models and prior information before expecting BRL to be the solution to our problem.


[1] Bayesian Reinforcement Learning: A Survey : Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar

[2] A Bayesian Sampling Approach to Exploration in Reinforcement Learning : John Asmuth, Lihong Li, Michael L. Littman, Ali Nouri, David Wingate

[3] Near-Bayesian Exploration in Polynomial Time: J. Zico Kolter kolter, Andrew Y. Ng

[4] Deep Probabilistic Programming for Financial Modeling : Matthew F. Dixon

[5] Reinforcement Learning in Healthcare: A Survey Chao Yu, Jiming Liu, Fellow, IEEE, and Shamim Nemati

[6] Reinforcement Learning in Marketing : Deepthi A R

[7] Challenges of Real-World Reinforcement Learning: Gabriel Dulac-Arnold, Daniel Mankowitz, Todd Hester

[8] Challenges of real-world reinforcement learning: ADRIAN COLYER