can you reexplain the concept of ‘online’ RL and ‘offline’ RL?
Online RL is what we're going to focus on in this practical, and what we focused on in the lecture yesterday. This is where you are learning at the same time as you explore the environment.
Offline RL is the scenario when you cannot explore. So you might just have a dataset of states/actions/rewards that someone else has generated and given to you.

Is there a systematic way to define the rewards?
I'm sorry, can you re-explain, what is "observations"? why is the value [9, 10, 3]?

Observations are the same as the states we talked about yesterday.

Hi Rian! That's not the value of the observation, it's the shape of it.

This is what we mean by "spec" or specification. It's the shape, dtype, and possibly minimum and maximum values.

is the 3D environment more difficult to operate than the 2D environment?

Each episode is solvable right? I mean the wall will not close the path from S to G?

Adhi: in principle, the tabular agents work the same in 2 or 3D, but exploration (finding the reward in the first place) can be much more difficult in 3D.

Note that here, the environment is not 3D, the 3 layers or 9x10 grids is one way to encode where the goal is, where the agent is, etc.

observations = states because this is fully observable MDP?

Chris: Yes, in fact in this case, there is no procedural generation of episodes to simplify this lab session.

Can we think of it as a control system with a feedback? i.e. make an action, take feedback, and based on the feedback get a reward or another action.

Said: Exactly, in this case we use them interchangeably because the MDP is fully observed.

Georgios: Yes, if you come from a control theoretical background, this may be a more intuitive way to think of this :)

What is the stopping criteria for this loop?

Side question: Following up my previous question, I was wondering if, in more challenging environments, e.g. self-driving cars, where there are tons of possible rewards depending on the state and action, how do they define the rewards?
if we're working with physical robot and want to implement RL. can we move the training phase into the virtual environments ? I think it's a bad idea to train RL in real worlds due to more cost time needed when RL learn to move robot and if the robots hit the wall / collision its probably broke the hardware parts.

Is the current self-driving car use a reinforcement learning-based model? I think it is just a combination of supervised-learning-based models.

@Riyad In robotics we always simulate first before apply anything on the actual robot.

Riyad: Yes indeed! This is a big question in RL, often called sim2real, as it asks the question, if we learn to behave optimally in a simulator, can we transfer this policy to the real world? In practice this is actually very challenging and an interesting open research question.

apologies if this is bit of a random/unrelated question but I was just curious if you knew wether some areas of RL research have some intersection with Quantum Computing research? The fact that the system has a probabilistic state and you want to evolve it in time to future states kind of reminds me of quantum wavefunctions..

How about if we generate an environment that is similar to real world with something like GAN?

Is there a methodology (in loose terms) when designing/building a Reinforcement learning system?

Georgios: Not sure if this answers your question, but you can perhaps think of all these value-based methods as following one such "methodology" whereby they are designed to propagate the observed rewards to the state-action pairs that produced them.

There are many ways to do this, and SARSA and Q-learning are a couple that we are seeing here.
Later Matt will present a different such "methodology" that is focussed on a policy rather than the values.

And finally the Actor-critic family of methods combines both of these ideas. Sadly we will not be covering any today but see the Acme implementations of Impala/DDPG/D4PG if you want to explore actor-critic algorithms. :)

is there any way that we can expand the observation/action for a trained network with smaller observation/action? maybe similar to something like continual learning.

where can I get how behaviour_policy() is implemented? is it different with epsilon-greedy?

Chris: This is a very interesting research question. Indeed one can imagine fine tuning a learned model on a similar task with an additional sensory signal. However, being able to incorporate new sensory signals while not catastrophically forgetting what has been learned already is very delicate.

Can experience replay cause overfitting to some of the past episodes? Because some episodes might be sampled more than the other episodes.

what is 'prioritized experience replay' ? and how it's prioritized the experience ?

Wawan: Very good catch! Indeed, this is one of the concerns when using a replay. While this can be more efficient (less interaction with the environment), it also increases your bias to your previous experience.

Riyad: There are many ways you can prioritize replay, but one possible example is to use the TD error. Effectively more frequently sampling experience that "surprise" your value estimator.

@Bobak: is there any solution to the problem?

Everyone: Please feel free to ask for clarification if my answers don't clarify things!

Wawan: The approach we use in Acme is to use rate limitation (facilitated by the Reverb replay system). This way we at least fix how many times the same experience can be sampled (on average); this does not "solve" the problem, but it at least allows us to increase/reduce this quantity to balance sample efficiency vs overfitting.

Mat, may I know why u set last_loss = 0.0 ? thanks

sorry my question is out of topic. is Acme framework support environment simulation with Unity ML-Agents ?

Riyad: By construction the Acme framework does not make any assumptions about your environment as long as it can interface with it via the dm_env.Environment API.

maybe you can zoom out the browser

Does it help to use a pretrained CNN model for reinforcement learning?
Sorry, I mean a pretrained CNN from image classification task
can we transfer (both intrinsic and extrinsic) rewards from similar environment?

200 I believe.

