Discussion

From Matt Hoffman to Everyone: (9:30 PM)

  • 
https://colab.research.google.com/github/feryal/rl_mlss_2020/blob/master/RL_Tutorial_MLSS_2020.ipynb


From SAID AL FARABY to Everyone: (9:30 PM)

  • 
https://colab.research.google.com/github/feryal/rl_mlss_2020/blob/master/RL_Tutorial_MLSS_2020.ipynb
thanks matt


From Georgios to Everyone: (9:30 PM)

  • 
Thank you, much obliged.
perfect


From Bobak Shahriari to Everyone: (9:32 PM)

  • 
Hi everyone! :)


From SAID AL FARABY to Everyone: (9:32 PM)

  • 
Thank you Bobak


From Paola to Everyone: (9:33 PM)

  • 
Thanks


From Me to Everyone: (9:36 PM)

  • 
can you reexplain the concept of ‘online’ RL and ‘offline’ RL?
thank you :)


From Matt Hoffman to Everyone: (9:37 PM)

  • 
Online RL is what we're going to focus on in this practical, and what we focused on in the lecture yesterday. This is where you are learning at the same time as you explore the environment.
Offline RL is the scenario when you cannot explore. So you might just have a dataset of states/actions/rewards that someone else has generated and given to you.


From Me to Everyone: (9:38 PM)

  • 
thank you


From Javier Carnerero Cano to Everyone: (9:47 PM)

  • 
Is there a systematic way to define the rewards?
Ok thank you very much!


From Rian Adam Rajagede, S.Kom., M.Cs. to Everyone: (9:50 PM)

  • 
I'm sorry, can you re-explain, what is "observations"? why is the value [9, 10, 3]?


From Matt Hoffman to Everyone: (9:50 PM)

  • 
Observations are the same as the states we talked about yesterday.


From Bobak Shahriari to Everyone: (9:51 PM)

  • 
Hi Rian! That's not the value of the observation, it's the shape of it.


From Rian Adam Rajagede, S.Kom., M.Cs. to Everyone: (9:51 PM)

  • 
oh.. I see.. thank you Matt and Bobak


From Bobak Shahriari to Everyone: (9:51 PM)

  • 
This is what we mean by "spec" or specification. It's the shape, dtype, and possibly minimum and maximum values.


From Adhi Prahara to Everyone: (9:53 PM)

  • 
is the 3D environment more difficult to operate than the 2D environment?


From chris simon to Everyone: (9:54 PM)

  • 
Each episode is solvable right? I mean the wall will not close the path from S to G?


From Bobak Shahriari to Everyone: (9:55 PM)

  • 
Adhi: in principle, the tabular agents work the same in 2 or 3D, but exploration (finding the reward in the first place) can be much more difficult in 3D.

Note that here, the environment is not 3D, the 3 layers or 9x10 grids is one way to encode where the goal is, where the agent is, etc.


From Adhi Prahara to Everyone: (9:56 PM)

  • 
ah I see thank you bobak


From SAID AL FARABY to Everyone: (9:56 PM)

  • 
observations = states because this is fully observable MDP?


From Bobak Shahriari to Everyone: (9:56 PM)

  • 
Chris: Yes, in fact in this case, there is no procedural generation of episodes to simplify this lab session.


From Georgios to Everyone: (9:56 PM)

  • 
Can we think of it as a control system with a feedback? i.e. make an action, take feedback, and based on the feedback get a reward or another action.


From Bobak Shahriari to Everyone: (9:57 PM)

  • 
Said: Exactly, in this case we use them interchangeably because the MDP is fully observed.


From Bobak Shahriari to Everyone: (9:57 PM)

  • 
Georgios: Yes, if you come from a control theoretical background, this may be a more intuitive way to think of this :)


From Georgios to Everyone: (9:58 PM)

  • 
Thank you


From yusril maulidan to Everyone: (10:03 PM)

  • 
true :)


From chris simon to Everyone: (10:08 PM)

  • 
What is the stopping criteria for this loop?


From Javier Carnerero Cano to Everyone: (10:10 PM)

  • 
Side question: Following up my previous question, I was wondering if, in more challenging environments, e.g. self-driving cars, where there are tons of possible rewards depending on the state and action, how do they define the rewards?
Great! Thanks


From Riyad Febrian to Everyone: (10:15 PM)

  • 
if we're working with physical robot and want to implement RL. can we move the training phase into the virtual environments ? I think it's a bad idea to train RL in real worlds due to more cost time needed when RL learn to move robot and if the robots hit the wall / collision its probably broke the hardware parts.


From Wawan Cenggoro to Everyone: (10:16 PM)

  • 
Is the current self-driving car use a reinforcement learning-based model? I think it is just a combination of supervised-learning-based models.


From Georgios to Everyone: (10:17 PM)

  • 
@Riyad In robotics we always simulate first before apply anything on the actual robot.


From Bobak Shahriari to Everyone: (10:17 PM)

  • 
Riyad: Yes indeed! This is a big question in RL, often called sim2real, as it asks the question, if we learn to behave optimally in a simulator, can we transfer this policy to the real world? In practice this is actually very challenging and an interesting open research question.


From Wickens to Everyone: (10:19 PM)

  • 
apologies if this is bit of a random/unrelated question but I was just curious if you knew wether some areas of RL research have some intersection with Quantum Computing research? The fact that the system has a probabilistic state and you want to evolve it in time to future states kind of reminds me of quantum wavefunctions..


From Bobak Shahriari to Everyone: (10:19 PM)

  • 
:)


From Riyad Febrian to Everyone: (10:21 PM)

  • 
Thanks @Georgios @Bobak.


From Wawan Cenggoro to Everyone: (10:22 PM)

  • 
How about if we generate an environment that is similar to real world with something like GAN?
yes


From Georgios to Everyone: (10:27 PM)

  • 
Is there a methodology (in loose terms) when designing/building a Reinforcement learning system?


From Bobak Shahriari to Everyone: (10:31 PM)

  • 
Georgios: Not sure if this answers your question, but you can perhaps think of all these value-based methods as following one such "methodology" whereby they are designed to propagate the observed rewards to the state-action pairs that produced them.

There are many ways to do this, and SARSA and Q-learning are a couple that we are seeing here.
Later Matt will present a different such "methodology" that is focussed on a policy rather than the values.


From Georgios to Everyone: (10:32 PM)

  • 
Thank you


From Bobak Shahriari to Everyone: (10:33 PM)

  • 
And finally the Actor-critic family of methods combines both of these ideas. Sadly we will not be covering any today but see the Acme implementations of Impala/DDPG/D4PG if you want to explore actor-critic algorithms. :)


From chris simon to Everyone: (10:35 PM)

  • 
is there any way that we can expand the observation/action for a trained network with smaller observation/action? maybe similar to something like continual learning.


From Rian Adam Rajagede, S.Kom., M.Cs. to Everyone: (10:35 PM)

  • 
where can I get how behaviour_policy() is implemented? is it different with epsilon-greedy?


From Bobak Shahriari to Everyone: (10:41 PM)

  • 
Chris: This is a very interesting research question. Indeed one can imagine fine tuning a learned model on a similar task with an additional sensory signal. However, being able to incorporate new sensory signals while not catastrophically forgetting what has been learned already is very delicate.


From Wawan Cenggoro to Everyone: (10:46 PM)

  • 
Can experience replay cause overfitting to some of the past episodes? Because some episodes might be sampled more than the other episodes.


From Riyad Febrian to Everyone: (10:48 PM)

  • 
what is 'prioritized experience replay' ? and how it's prioritized the experience ?


From Bobak Shahriari to Everyone: (10:49 PM)

  • 
Wawan: Very good catch! Indeed, this is one of the concerns when using a replay. While this can be more efficient (less interaction with the environment), it also increases your bias to your previous experience.


From Bobak Shahriari to Everyone: (10:50 PM)

  • 
Riyad: There are many ways you can prioritize replay, but one possible example is to use the TD error. Effectively more frequently sampling experience that "surprise" your value estimator.


From Wawan Cenggoro to Everyone: (10:51 PM)

  • 
@Bobak: is there any solution to the problem?


From Bobak Shahriari to Everyone: (10:51 PM)

  • 
Everyone: Please feel free to ask for clarification if my answers don't clarify things!


From Bobak Shahriari to Everyone: (10:54 PM)

  • 
Wawan: The approach we use in Acme is to use rate limitation (facilitated by the Reverb replay system). This way we at least fix how many times the same experience can be sampled (on average); this does not "solve" the problem, but it at least allows us to increase/reduce this quantity to balance sample efficiency vs overfitting.


From Wawan Cenggoro to Everyone: (10:54 PM)

  • 
I see, thanks for the answer


From Hariyanti Binti Mohd Saleh to Everyone: (10:56 PM)

  • 
Mat, may I know why u set last_loss = 0.0 ? thanks


From Riyad Febrian to Everyone: (11:03 PM)

  • 
sorry my question is out of topic. is Acme framework support environment simulation with Unity ML-Agents ?


From Bobak Shahriari to Everyone: (11:07 PM)

  • 
Riyad: By construction the Acme framework does not make any assumptions about your environment as long as it can interface with it via the dm_env.Environment API.


From Wawan Cenggoro to Everyone: (11:15 PM)

  • 
maybe you can zoom out the browser


From Wawan Cenggoro to Everyone: (11:34 PM)

  • 
Does it help to use a pretrained CNN model for reinforcement learning?
Sorry, I mean a pretrained CNN from image classification task
thanks for the answer


From Me to Everyone: (11:38 PM)

  • 
can we transfer (both intrinsic and extrinsic) rewards from similar environment?


From Bobak Shahriari to Everyone: (11:39 PM)

  • 
200 I believe.


From Me to Everyone: (11:42 PM)

  • 
thank you very much. that answers my questions :)


From Bobak Shahriari to Everyone: (11:49 PM)

  • 
Not as much as you, Feryal! :)


From Ade Romadhony to Everyone: (11:50 PM)

  • 
thank you very much Feryal and Matt!


From Me to Everyone: (11:50 PM)

  • 
thank you very much. it’s really perfect !